The contents of the electronic sequence listing (“BROD-5360US_ST25.txt”; Size is 205,235 bytes and it was created on Feb. 17, 2022) is herein incorporated by reference in its entirety.
The subject matter disclosed herein is generally directed to synthetic DNA spike-ins and their use for detecting, quantifying, and preventing amplification contamination in genome profiling analysis.
The COVID-19 pandemic has demonstrated, once again, the crucial role of genomic sequencing in combatting infectious disease outbreaks globally. Monitoring the emergence of pathogens and the spread of variants of concern has become commonplace in government, academic, and private laboratories1,2. Genomics data provides insights into the diversity, evolution and transmission of a virus, a critical guide for public health interventions ranging from contact tracing, identifying cases of reinfection, or documenting resistance to clinical interventions3-6. In the year since, genomic data have provided new insights into the diversity, evolution and transmission of the virus, which has increasingly been used to guide impactful public health interventions. In particular, scientists have employed viral genome sequencing to characterize the fine-scale epidemiology of clusters and superspreading events (Lemieux et al., 2021, Phylogenetic analysis of SARS-CoV-2 in Boston highlights the impact of superspreading events, Science, 371(6529); Popa et al., 2020, Genomic epidemiology of superspreading events in Austria reveals mutational dynamics and transmission properties of SARS-CoV-2, Science Translational Medicine, 12(573); Volz et al., 2021, Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking epidemiological and genetic data, bioRxiv, medRxiv). More recently, genome sequencing to monitor the emergence of new lineages and the spread of variants of concern (VoC) has become paramount (Washington et al., 2021, Genomic epidemiology identifies emergence and rapid transmission of SARS-CoV-2 B.1.1.7 in the United States, medRxiv). As laboratories are now performing viral genomic sequencing on SARS-CoV-2 at an unprecedented scale7,8, it highlights the need for stringent requirements to ensure the integrity of genomes being produced.
Multiplexed amplicon-based genome sequencing methods have accelerated the massive scale of SARS-CoV-2 genomic surveillance due to their improved sensitivity, cost, and speed over other, lower-amplification RNA sequencing approaches, such as unbiased metagenomic sequencing9. Unsurprisingly, amplicon-based approaches that target the SARS-CoV-2 genome for amplification and subsequent sequencing have become the genomic surveillance method of choice during the ongoing pandemic (over 90% of Short Read Archive submissions). In just a year since the first genome sequence enabled the identification of SARS-CoV-2, hundreds of thousands of complete genomes have been sequenced and released by a relatively small group of several hundred laboratories. An open-access tiled primer set developed by the ARTIC network (artic.network/) is the most widely used method for SARS-CoV-2 specific genome amplification followed by sequencing on either Illumina or nanopore instruments (Quick et al., 2017; Tyson et al., 2020). A wide array of protocols and publications are now available that integrate these ARTIC primers with different amplification and library construction indexing strategies (Baker et al., 2020; Gohl et al., 2020). Approaches such as batching samples by viral load to increase sensitivity are impractical to scale to current needs, resulting in incomplete recovery of viral genomes, especially from low titer samples.
However, the risk for contamination during the amplification stage is especially high as the 35 or more cycles of virus-specific PCR produces trillions of SARS-CoV-2 amplicons in a single reaction. Other high-risk modes of contamination, including sample swaps, cross-contamination of samples, or aerosolization, can occur throughout the sample processing pipeline. With many laboratories performing viral sequencing by processing multiple large batches in parallel, the potential for contamination increases10. Even small amounts of sample mixing or contaminating amplicons could potentially confound studies where viral detection is sensitive to only tens of molecules10,11. Moreover, as SARS-CoV-2 has relatively low genetic diversity and often spreads in local outbreaks or clusters11,12, many genomes are expected to be identical at the consensus level11,15-17, a pattern that could also be observed due to contamination. The risk of contamination, and the challenges in detecting it, can confound a wide array of genomic analyses including estimates of the frequencies of variants, lineage dynamics, and transmission events. Additionally, methods to address the critical risk of sample processing errors in clinical sequencing could enable its use more widely in clinical decision making.
To meet the genomic surveillance goals laid out by local and world governments, sequencing efforts will need to be scaled to thousands of centers, many performing viral genomics for the first time. Additional laboratories will enter the SARS-CoV-2 sequencing space with an emphasis to rapidly surveil VoCs for clinical significance, with even higher requirements to ensure the integrity of SARS-CoV-2 genomes being produced. While inclusion of internal standards is commonplace in many experimental approaches13-15 and some technical assay controls exist for DNA sequencing16-18, the use of internal controls is currently rare in amplicon-based genomic surveillance. Here Applicants developed and extensively tested a sample identification method using 96 synthetic DNA spike-ins (SDSIs) for amplicon-based sequencing approaches. Using the widely used open-access ARTIC tiled primer design (artic.network/), Applicants implemented these SDSIs for SARS-CoV-2 genomic sequencing from thousands of residual diagnostic (clinical) samples. The resulting user-friendly and highly versatile SDSI+AmpSeq protocol can be easily implemented to improve the quality of genomic data generated for epidemiological and clinical investigations of human pathogens (
Citation or identification of any document in this application is not an admission that such a document is available as prior art to the present invention.
In one aspect, the present invention provides for a method of detecting and preventing contamination in one or more cDNA samples comprising adding a synthetic DNA spike-in (SDSI) to each cDNA sample, wherein each SDSI is capable of amplification simultaneously with the cDNA, and wherein each SDSI comprises a unique sequence capable of differentiating each SDSI; amplifying one or more of the cDNA samples and SDSI; sequencing the amplified sample; and determining the number of reads of the spike-in from the one or more samples. In certain example embodiments, the sample is associated with drug resistance. In certain example embodiments, the sample is for sequencing a pathogen or family of pathogens. In certain example embodiments, the pathogen is a virus. In certain example embodiments, the pathogen is a bacteria and the region sequenced is associated with antibiotic resistance. In certain example embodiments, each sample contains a viral nucleic acid sequence. In certain example embodiments, the samples are for creating one or more sequencing families/clusters.
In certain example embodiments, the SDSI contains a core region and a primer binding region at the 3′ end and the 5′ end. In certain example embodiments, the core sequence of the SDSI is derived from a rare organism. In certain example embodiments, the rare organism is a thermophilic archaea. In certain example embodiments, the core sequence homology is less than 65%, or less than 60%, or less than 55%, or less than 50%, or less than 45%, or less than 40%, or less than 35%, or less than 30%, or less than 25%, or less than 20%, or less than 15%, or less than 5%, or less than 1% to a sample sequence. In certain example embodiments, the core sequence homology is less than 15, or less than 20, or less than 25, or less than 30, or less than 35, or less than 40, or less than 45, or less than 50 contiguous bases in common with the sample sequence.
In certain example embodiments, the synthetic DNA spike-in sequences are 50-5000 nucleotides in length. In certain example embodiments, the SDSI minimizes self-hybridization and cross-hybridization with nucleic acids in the sample. In certain example embodiments, the primer binding sites of the SDSI have a Tm between 55-65° C. In certain example embodiments, the method further comprises a plurality of SDSIs. In certain example embodiments, the core sequence of the synthetic DNA comprises a sequence as set forth in SEQ ID NOS: 1-96 and 193-291. In certain example embodiments, the primer binding sequences are complementary to the primers having SEQ ID NOS: 391 and 392. In certain example embodiments the SDSIs comprise one or more of SEQ ID NOS: 97-192 and 292-390. In example embodiments, sequences can be used in the alternative. In one example embodiment, sequence SEQ ID NO: 289 can substitute for sequence SEQ ID NO: 16. In one example embodiment, sequence SEQ ID NO: 290 can substitute for sequence SEQ ID NO: 57. In one example embodiment, sequence SEQ ID NO: 291 can substitute for sequence SEQ ID NO: 66. In one example embodiment, sequence SEQ ID NO: 388 can substitute for sequence SEQ ID NO: 112. In one example embodiment, sequence SEQ ID NO: 389 can substitute for sequence SEQ ID NO: 153. In one example embodiment, sequence SEQ ID NO: 390 can substitute for sequence SEQ ID NO: 162. In one example embodiment, one or more of SEQ ID NOS: 16, 57, 66, 112, 153, and 162 can be substituted with their alternative sequence SEQ ID NOS: 289, 290, 291, 388, 389, and 390, respectively.
In certain example embodiments, the concentration of synthetic DNA spike-ins range from 0.1 femtomolar-1.0 femtomolar. In certain example embodiments, the presence of an amplified spike-in corresponding to the spike-in added to a sample indicates a decreased risk of contamination. In certain example embodiments, the presence of an amplified spike-in corresponding to the spike-in not added to a sample indicates an increased risk of contamination.
In another aspect, the present invention is a set of synthetic DNA spike-ins (SDSIs), each SDSI in the set comprising a primer binding sequence at the 3′ and 5′ end and a unique core sequence between the 3′ and 5′ primer binding sequences. In certain example embodiments, the set comprises at least 96 spike-ins. In certain example embodiments, the unique core sequence is derived from a rare organism. In certain example embodiments, the rare organism is a thermophilic archaea. In certain example embodiments, the core sequence homology is less than 65%, or less than 60%, or less than 55%, or less than 50%, or less than 45%, or less than 40%, or less than 35%, or less than 30%, or less than 25%, or less than 20%, or less than 15%, or less than 5%, or less than 1% to a sample sequence. In certain example embodiments, the core sequence homology is less than 15, or less than 20, or less than 25, or less than 30, or less than 35, or less than 40, or less than 45, or less than 50 contiguous bases in common with the sample sequence.
In certain example embodiments, the sequence is 50-5000 nucleotides in length. In certain example embodiments, the SDSIs minimizes self-hybridization and cross-hybridization with nucleic acids in the sample. In certain example embodiments, the primer binding sites have a Tm between 55-65° C. In certain example embodiments, the core sequence are the unique sequences as set forth SEQ ID NOS: 1-96 and 193-291. In certain example embodiments, the primer binding sequences are complementary to the primers having SEQ ID NOS: 391 and 392. In certain example embodiments, the SDSIs comprise one or more of SEQ ID NOS: 97-192 and 292-390. In example embodiments, sequences can be used in the alternative. In one example embodiment, sequence SEQ ID NO: 289 can substitute for sequence SEQ ID NO: 16. In one example embodiment, sequence SEQ ID NO: 290 can substitute for sequence SEQ ID NO: 57. In one example embodiment, sequence SEQ ID NO: 291 can substitute for sequence SEQ ID NO: 66. In one example embodiment, sequence SEQ ID NO: 388 can substitute for sequence SEQ ID NO: 112. In one example embodiment, sequence SEQ ID NO: 389 can substitute for sequence SEQ ID NO: 153. In one example embodiment, sequence SEQ ID NO: 390 can substitute for sequence SEQ ID NO: 162. In one example embodiment, one or more of SEQ ID NOS: 16, 57, 66, 112, 153, and 162 can be substituted with their alternative sequence SEQ ID NOS: 289, 290, 291, 388, 389, and 390, respectively.
These and other aspects, objects, features, and advantages of the example embodiments will become apparent to those having ordinary skill in the art upon consideration of the following detailed description of example embodiments.
An understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention may be utilized, and the accompanying drawings of which:
The figures herein are for illustrative purposes only and are not necessarily drawn to scale.
Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. Definitions of common terms and techniques in molecular biology may be found in Molecular Cloning: A Laboratory Manual, 2nd edition (1989) (Sambrook, Fritsch, and Maniatis); Molecular Cloning: A Laboratory Manual, 4th edition (2012) (Green and Sambrook); Current Protocols in Molecular Biology (1987) (F. M. Ausubel et al. eds.); the series Methods in Enzymology (Academic Press, Inc.): PCR 2: A Practical Approach (1995) (M. J. MacPherson, B. D. Hames, and G. R. Taylor eds.): Antibodies, A Laboratory Manual (1988) (Harlow and Lane, eds.): Antibodies A Laboratory Manual, 2nd edition 2013 (E. A. Greenfield ed.); Animal Cell Culture (1987) (R. I. Freshney, ed.); Benjamin Lewin, Genes IX, published by Jones and Bartlet, 2008 (ISBN 0763752223); Kendrew et al. (eds.), The Encyclopedia of Molecular Biology, published by Blackwell Science Ltd., 1994 (ISBN 0632021829); Robert A. Meyers (ed.), Molecular Biology and Biotechnology: a Comprehensive Desk Reference, published by VCH Publishers, Inc., 1995 (ISBN 9780471185710); Singleton et al., Dictionary of Microbiology and Molecular Biology 2nd ed., J. Wiley & Sons (New York, N.Y. 1994), March, Advanced Organic Chemistry Reactions, Mechanisms and Structure 4th ed., John Wiley & Sons (New York, N.Y. 1992); and Marten H. Hofker and Jan van Deursen, Transgenic Mouse Methods and Protocols, 2nd edition (2011).
As used herein, the singular forms “a”, “an”, and “the” include both singular and plural referents unless the context clearly dictates otherwise.
The term “optional” or “optionally” means that the subsequent described event, circumstance or substituent may or may not occur, and that the description includes instances where the event or circumstance occurs and instances where it does not.
The recitation of numerical ranges by endpoints includes all numbers and fractions subsumed within the respective ranges, as well as the recited endpoints.
The terms “about” or “approximately” as used herein when referring to a measurable value such as a parameter, an amount, a temporal duration, and the like, are meant to encompass variations of and from the specified value, such as variations of +/−10% or less, +/−5% or less, +/−1% or less, and +/−0.1% or less of and from the specified value, insofar such variations are appropriate to perform in the disclosed invention. It is to be understood that the value to which the modifier “about” or “approximately” refers is itself also specifically, and preferably, disclosed.
As used herein, a “biological sample” may contain whole cells and/or live cells and/or cell debris. The biological sample may contain (or be derived from) a “bodily fluid”. The present invention encompasses embodiments wherein the bodily fluid is selected from amniotic fluid, aqueous humour, vitreous humour, bile, blood serum, breast milk, cerebrospinal fluid, cerumen (earwax), chyle, chyme, endolymph, perilymph, exudates, feces, female ejaculate, gastric acid, gastric juice, lymph, mucus (including nasal drainage and phlegm), pericardial fluid, peritoneal fluid, pleural fluid, pus, rheum, saliva, sebum (skin oil), semen, sputum, synovial fluid, sweat, tears, urine, vaginal secretion, vomit and mixtures of one or more thereof. Biological samples include cell cultures, bodily fluids, cell cultures from bodily fluids. Bodily fluids may be obtained from a mammal organism, for example by puncture, or other collecting or sampling procedures.
The terms “subject,” “individual,” and “patient” are used interchangeably herein to refer to a vertebrate, preferably a mammal, more preferably a human. Mammals include, but are not limited to, murines, simians, humans, farm animals, sport animals, and pets. Tissues, cells and their progeny of a biological entity obtained in vivo or cultured in vitro are also encompassed.
Various embodiments are described hereinafter. It should be noted that the specific embodiments are not intended as an exhaustive description or as a limitation to the broader aspects discussed herein. One aspect described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced with any other embodiment(s). Reference throughout this specification to “one embodiment”, “an embodiment,” “an example embodiment,” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” or “an example embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to a person skilled in the art from this disclosure, in one or more embodiments. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention. For example, in the appended claims, any of the claimed embodiments can be used in any combination.
All publications, published patent documents, and patent applications cited herein are hereby incorporated by reference to the same extent as though each individual publication, published patent document, or patent application was specifically and individually indicated as being incorporated by reference.
Embodiments disclosed herein provide a method of detecting and preventing contamination during genome profiling using synthetic DNA spike-ins (SDSIs). Embodiments disclosed herein also provide methods to track sample contamination by implementing synthetic DNA spike-ins (SDSIs) for sample verification. Embodiments disclosed herein also provide synthetic DNA spike-ins (SDSIs) and methods for producing synthetic DNA spike-ins (SDSIs). The global spread and continued evolution of SARS-CoV-2 has driven an unprecedented surge in viral genomic surveillance. Amplicon-based sequencing methods provide a sensitive, low-cost and rapid approach but suffer a high potential for contamination, which can undermine laboratory processes and results. This challenge will only increase with expanding global production of sequences by diverse laboratories for epidemiological and clinical interpretation, as well in genomic surveillance in future outbreaks. Applicants present SDSI+AmpSeq, an approach which uses synthetic DNA spike-ins (SDSIs) to track samples and detect inter-sample contamination through the sequencing workflow. Applying SDSIs to the ARTIC Consortium's amplicon design, Applicants demonstrated their utility and efficiency in a real-time investigation of a suspected hospital cluster of SARS-CoV-2 cases and across thousands of diagnostic samples at multiple laboratories. Applicants established that SDSI+AmpSeq provides increased confidence in genomic data by detecting and in some cases correcting for relatively common, yet previously unobserved modes of error without impacting genome recovery.
The methods described herein add a unique SDSI to each sample (e.g., cDNA) before performing a sequence amplification process during which the samples and SDSIs are amplified in the same reactions. This procedure can be repeated in parallel for each sample undergoing analysis. After the samples have been amplified, the presence of the SDSI is measured. If the SDSI introduced before amplification is the only SDSI present, then the sample is determined to be uncontaminated. However, the presence of any other SDSI immediately reveals contamination of the sample. This method provides a reliable safety measure for pathogen-genome studies and the resulting therapeutic and preventative medicine.
In one aspect, the present invention is directed to SDSI's and uses thereof. An example SDSI comprises, in a 5′ to 3′ direction, a 5′ primer binding sequence, a core sequence, and a 3′ primer binding sequence. In one example embodiment, spike-ins comprise sequences derived from a rare organism. A rare organism is a species that is limited in number or geographic occurrence relative to the distribution and abundance of other species making up the pool of interest. (Raphael, M. et al., Conservation of Rare or Little-Known Species: Biological, Social, and Economic Considerations. Bibliovault OAI Repository (2007) the University of Chicago Press) In some embodiments, the rare organism is an archaea. In some embodiments, the archaea is thermophilic. A thermophilic archaea may exist in environments with temperatures greater than 50° C. In certain embodiments, the present invention includes spike-ins. In certain embodiments, a spike-in comprises a DNA sequence that is not from the target organism. In certain embodiments, a spike in is an RNA molecule that can be added to a sample comprising pathogen RNA. In certain embodiments, the RNA is converted to cDNA concurrently with pathogen RNA. The RNA spike in cDNA can then be amplified with pathogen cDNA using pathogen specific primers and spike-in specific primers.
In certain embodiments, a spike-in sequence is compared to the target organism and the host for the target organism to limit homology. Limited homology can be determined using a BLAST search of all SDSIs. In one example embodiment, a permissive BLAST search is used (e.g., blastn; 5000 max targets; E=10; ws=11; no mask for low-complexity). Results may be filtered by species of interest, e.g. Homo sapiens. In one example embodiment, results can be filtered for a pathogen of interest (e.g., SARS-CoV-2). The query coverage and sequence identity may each be set for 35-100%, preferably, 50-100%, and sequences having no significant hits can be selected for use as a spike-in. In certain embodiments, a spike-in set comprises different DNA sequences that can be easily distinguished using sequencing.
In certain embodiments, the GC content of the spike-ins promote similar amplification rates across pathogen targets and the different SDSIs in our set. In one example embodiment, a spike-in comprises a similar GC content as the target organism. In another example embodiment, the GC content of the primer may range from 30%-80%. (Buck, G. A. et al., Design Strategies and Performance of Custom DNA Sequencing Primers, BioTechniques (1999) 27:3, 528-536). In another example embodiment, the GC content of the primer may range from or between 30%-40% nucleotides, or between 40%-50% nucleotides, or between 50%-60% nucleotides, or between 60%-70% nucleotides, or between 70%-80% nucleotides. In general GC content extremes are avoided. For example, sequences may have a median of 50% GC content, preferably, between 35-65%. In another example embodiment, the GC content of the primer may range from or between 40%-70%, or between 30%-50% nucleotides, or between 30%-60% nucleotides, or between 30%-70% nucleotides.
Each SDSI in the set is differentiated by its core sequences. The SDSI cores are designed to minimize self-hybridization and cross-hybridization with others nucleic acids in a given sample. Accordingly, core sequences are selected based on the type of target sequence to be amplified and the type of sample the target sequence is to be derived from. For example, in the context of detecting a pathogen in a human sample, core sequence should be selected with minimal homology to the target pathogen, other common microbes and non-target pathogens that might be present in the sample, and human sequences as well. In certain example embodiments, the core sequence has a homology of less than about 65%, or less than 64%, or less than 63%, or less than 62%, or less than 61%, or less than 60%, or less than 59%, or less than 58%, or less than 57%, or less than 56%, or less than 55%, or less than 54%, or less than 53%, or less than 52%, or less than 51%, or less than 50%, or less than 49%, or less than 48%, or less than 47%, or less than 46%, or less than 45%, or less than 44%, or less than 43%, or less than 42%, or less than 41%, or less than 40%, or less than 35%, or less than 30%, or less than 25%, or less than 20%, or less than 15%, or less than 10%, or less than 5%, or less than 1%.
The core sequence may vary in length between 50-5,000 nucleotides, or between 50-nucleotides, or between 50-4,500 nucleotides, or between 50-4,000 nucleotides, or between 50-4,000 nucleotides, or between 50-3,500 nucleotides, or between 50-3,000 nucleotides, or between 50-2,500 nucleotides, or between 50-2,000 nucleotides, or between 50-1,500 nucleotides, or between 50-1,000 nucleotides, or between 50-500 nucleotides.
The core sequence may vary in length between 50-60 nucleotides, or between 50-70 nucleotides, or between 50-80 nucleotides, or between 50-90 nucleotides, or between 50-100 nucleotides, or between 50-110 nucleotides, or between 50-120 nucleotides, or between 50-130 nucleotides, or between 50-140 nucleotides, or between 50-150 nucleotides, or between 50-160 nucleotides, or between 50-170 nucleotides, or between 50-180 nucleotides, or between 50-190 nucleotides, or between 50-200 nucleotides, or between 50-210 nucleotides, or between 50-220 nucleotides, or between 50-230 nucleotides, or between 50-240 nucleotides, or between 50-250 nucleotides, or between 50-260 nucleotides, or between 50-270 nucleotides, or between 50-280 nucleotides, or between 50-290 nucleotides, or between 50-300 nucleotides, or between 50-310 nucleotides, or between 50-320 nucleotides, or between 50-330 nucleotides, or between 50-340 nucleotides, or between 50-350 nucleotides, or between 50-360 nucleotides, or between 50-370 nucleotides, or between 50-380 nucleotides, or between 50-390 nucleotides, or between 50-400 nucleotides, or between 50-410 nucleotides, or between 50-420 nucleotides, or between 50-430 nucleotides, or between 50-440 nucleotides, or between 50-450 nucleotides, or between 50-460 nucleotides, or between 50-470 nucleotides, or between 50-480 nucleotides, or between 50-490 nucleotides, or between 50-500 nucleotides, or between 50-510 nucleotides, or between 50-520 nucleotides, or between 50-530 nucleotides, or between 50-540 nucleotides, or between 50-550 nucleotides, or between 50-560 nucleotides, or between 50-570 nucleotides, or between 50-580 nucleotides, or between 50-590 nucleotides, or between 50-600 nucleotides, or between 50-610 nucleotides, or between 50-620 nucleotides, or between 50-630 nucleotides, or between 50-640 nucleotides, or between 50-650 nucleotides, or between 50-660 nucleotides, or between 50-670 nucleotides, or between 50-680 nucleotides, or between 50-690 nucleotides, or between 50-700 nucleotides, or between 50-710 nucleotides, or between 50-720 nucleotides, or between 50-730 nucleotides, or between 50-740 nucleotides, or between 50-750 nucleotides, or between 50-760 nucleotides, or between 50-770 nucleotides, or between 50-780 nucleotides, or between 50-790 nucleotides, or between 50-800 nucleotides, or between 50-810 nucleotides, or between 50-820 nucleotides, or between 50-830 nucleotides, or between 50-840 nucleotides, or between 50-850 nucleotides, or between 50-860 nucleotides, or between 50-870 nucleotides, or between 50-880 nucleotides, or between 50-890 nucleotides, or between 50-900 nucleotides, or between 50-910 nucleotides, or between 50-920 nucleotides, or between 50-930 nucleotides, or between 50-940 nucleotides, or between 50-950 nucleotides, or between 50-960 nucleotides, or between 50-970 nucleotides, or between 50-980 nucleotides, or between 50-990 nucleotides, or between 50-1000 nucleotides, or between 50-1010 nucleotides.
The core sequence may vary in length between 100-5,000 nucleotides, or between 1,000-5,000 nucleotides, or between 2,000-5,000 nucleotides, or between 3,000-5,000 nucleotides, or between 4,000-5,000 nucleotides.
The core sequence may vary in length between 75-150 nucleotides, or between 100-150 nucleotides, or between 100-200 nucleotides, or between 100-300 nucleotides, or between 150-200, or between 150-250 nucleotides.
The homology to a target sequence or non-target sequence in the sample across the size of a given core sequence may be less than 1 nucleotide, or may be less than 2 nucleotides, or may be less than 3 nucleotides, or may be less than 4 nucleotides, or may be less than 5 nucleotides, or may be less than 6 nucleotides, or may be less than 7 nucleotides, or may be less than 8 nucleotides, or may be less than 9 nucleotides, or may be less than 10 nucleotides, or may be less than 11 nucleotides, or may be less than 12 nucleotides, or may be less than 13 nucleotides, or may be less than 14 nucleotides, or may be less than 15 nucleotides, or may be less than 16 nucleotides, or may be less than 17 nucleotides, or may be less than 18 nucleotides, or may be less than 19 nucleotides, or may be less than 20 nucleotides, or may be less than 21 nucleotides, or may be less than 22 nucleotides, or may be less than 23 nucleotides, or may be less than 24 nucleotides, or may be less than 25 nucleotides,
The homology to a target sequence or non-target sequence in the sample across the size of a given core sequence may vary in length between 1-5 nucleotides, or between 1-10 nucleotides, or between 1-15 nucleotides, or between 1-20 nucleotides, or between 1-25 nucleotides, or between 1-5 nucleotides, or between 5-10 nucleotides, or between 10-15 nucleotides, or between 15-20 nucleotides, or between 20-25 nucleotides, or between 1-10 nucleotides, or between 10-20 nucleotides, or between 20-30 nucleotides.
These SDSIs can be implemented in a wide range of genome profiling applications including, but not limited to, investigations of SARS-CoV-2 epidemiology and emerging viral variants. Exemplary SDSIs are provided in Table 1.
Table 1. Sequences of 96 unique SDSIs. The unique core of each SDSIs is 140 bps long (SEQ ID NOS: 1-96 and 193-291). The unique SDSIs including the priming regions (SEQ ID NOS: 97-192 and 292-390). Alternative sequences are also included. SEQ ID NOS: 16, 57, 66, 112, 153, and 162 can be, in the alternative, substituted with 289, 290, 291, 388, 389, and 390 respectively. Sequences for forward and reverse primers for amplifying the SDISs (SEQ ID NOS: 391 and 392 respectively).
The 5′ and 3′ primer binding sequences are selected to be complementary to a SDSI 5′ and 3′ primer which is included in an amplification reaction and used to amplify SDSIs present in a given sample. The primer binding sites may be optimized for multiplex amplification with a set of primers used to amplify a genome for sequencing. In one example embodiments, the 5′ and 3′ primer binding sites have a Tm of between 55-65° C. In one example embodiment, the 5′ and 3′ primer binding site are complementary to primers having SEQ ID NOS: 391 and 392.
In one example embodiment, a method of detecting and preventing contamination in one or more amplification reactions comprises adding a SDSI according to the example embodiments disclosed above to a one or more samples to be assayed. An amplification reaction is then used to amplify a target sequence in the samples. The amplification reaction will include probes and primers needed to amplify the target sequence and to amplify the SDSI. The amplicons generated from the amplification step are then used the one or more samples, sequencing the amplified samples and determining the number of reads of the SDSI from the one or more samples, wherein detection of only a single SDSI in the sample indicates contamination free amplification of the same, and wherein detection of multiple SDSI's indicates possible contamination of the sample. Samples identified as potentially contaminated may then be discarded or marked for repeat to confirm accuracy of results.
The present invention solves this problem by providing for the sequencing of spike-DNA sequences at concentrations that can be amplified concurrently with the nucleic acids of interest. In one example embodiment, sequencing includes extracting total RNA or DNA from a biological sample, such as a sample collected with a swab (e.g., nasal, rectal, vaginal). Methods of extracting total RNA or DNA are known in the art and commercial kits are available. The presence of a pathogen may be confirmed in a sample. Exemplary methods for confirming include PCR, RT-PCR and RT-qPCR. In certain embodiments, sequencing includes DNase treatment to remove residual DNA. In certain example embodiments, sequencing may include depletion of ribosomal RNA (rRNA). In certain example embodiments, cDNA may be prepared from total RNA using RT-PCR. In certain example embodiments, RT-PCR may be performed using random hexamer priming. In one example embodiments, a SDSI is added to each cDNA sample. The SDSI can be added to the total cDNA sample. In certain example embodiments, cDNA samples may be normalized to a constant amplification level. In certain example embodiments, real time PCR may be performed on the cDNA using one or more standard primers and a Ct value is used to normalize cDNA samples. As used herein, standard primers refer to a primer set that is used for every sample. In certain embodiments, the standard primers are directed to a region of the pathogen to be sequenced. The samples can be diluted such that all of the samples for amplification have the same Ct value in the amplification reaction. In certain embodiments, each sample is normalized to a Ct value less than 35, 34, 33, 32, 30, 29, 28, 27, 26, 25, or 24. In preferred embodiments, the samples are normalized to a Ct value of 26 to 28, preferably 27. In one example embodiment, a SDSI is added to the normalized sample used for PCR amplification of the pathogen. The cDNA may be amplified in the same reaction with pathogen specific primers and primers specific to the SDSI. Amplification may be performed in a multi-well plate (e.g., a standard PCR plate).
In certain example embodiments, the primer concentration is 100 μM. In certain example embodiments, the primer concentration is between 50 μM-150 or between 50 μM-200 μM, or between 50 μM-250 μM, or between 50 μm-250 μM or between 50 μm-300 μM or between 50 μm-350 μM or between 50 μm-400 μM or between 50 μm-450 μM or between 50 μm-500 μM. In certain example embodiments, the primer concentrations is between 50 μm-70 μM or between 70 μm-90 μM or between 90 μm-110 μM or between 110 μm-130 μM or between 130 μm-150 μM or between 150 μm-170 μM or between 170 μm-190 μM or between 190 μm-210 μM or between 210 μm-230 μM or between 230 μm-250 μM or between 250 μm-270 μM or between 270 μm-290 μM or between 290 μM-310 μM or between 310 μM-330 μM or between 330 μm-350 μM or between 350 μm-370 μM or between 370 μm-390 μM or between 390 μm-410 μM or between 410 μm-430 μM or between 430 μm-450 μM or between 450 μm-470 μM or between 470 μm-490 μM. In certain example embodiments, the primer concentration is between 50 μm-100 μM, M or between 100 μm-150 μM or between 150 μm-200 μM or between 200 μm-250 μM or between 250 μm-300 μM or between 300 μm-350 μM or between 350 μm-400 μM or between 400 μm-450 μM or between 450 μm-500 μM.
In certain example embodiments, a spike-in may be relatively the same length as the amplicons generated for the target organism. In one example embodiment, spike-ins are the same size and share the same priming region to ensure similar amplification performance. In certain embodiments, a spike-in for MNase-seq, ChIP-seq, and genomic DNA are around 150 nucleotides in length. In one example embodiment, a spike-in accounts for 0.1%-3.5% reads. A spike-in to total sample ratio may be from 1,000:1 to 50:1. In one example embodiment, a spike-in includes primer binding sites on the 3′ end and/or the 5′ end. (Chen K., et al., The overlooked fact: fundamental need for spike-in control for virtually all genome-wide analyses. Mol Cell Biol (2016) 36:662-667) The primers and primer binding sites on the SDSI may range between 15-40 nucleotides in length. The primer's melting temperature (Tm) may range from 40° C.-95° C., preferably between 55-65° C.
After amplification of cDNA, standard sequence library generation can be performed. In certain embodiments, sequencing comprises high-throughput (formerly “next-generation”) technologies to generate sequencing reads. In DNA sequencing, a read is an inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules or generating complementary DNA (cDNA) fragments, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads. Methods for constructing sequencing libraries are known in the art (see, e.g., Head et al., Library construction for next-generation sequencing: Overviews and challenges. Biotechniques. 2014; 56(2): 61-77). A “library” or “fragment library” may be a collection of nucleic acid molecules derived from one or more nucleic acid samples, in which fragments of nucleic acid have been modified, generally by incorporating terminal adapter sequences comprising one or more primer binding sites and identifiable sequence tags. In certain embodiments, the library members (e.g., genomic DNA, cDNA) may include sequencing adaptors that are compatible with use in, e.g., Illumina's reversible terminator method, long read nanopore sequencing, Roche's pyrosequencing method (454), Life Technologies' sequencing by ligation (the SOLiD platform) or Life Technologies' Ion Torrent platform. Examples of such methods are described in the following references: Margulies et al (Nature 2005 437: 376-80); Schneider and Dekker (Nat Biotechnol. 2012 Apr. 10; 30(4):326-8); Ronaghi et al. (Analytical Biochemistry 1996 242: 84-9); Shendure et al. (Science 2005 309: 1728-32); Imelfort et al. (Brief Bioinform. 2009 10:609-18); Fox et al. (Methods Mol. Biol. 2009; 553:79-108); Appleby et al. (Methods Mol. Biol. 2009; 513:19-39); and Morozova et al. (Genomics. 2008 92:255-64), which are incorporated by reference for the general descriptions of the methods and the particular steps of the methods, including all starting products, reagents, and final products for each of the steps.
In one example embodiment, any suitable RNA or DNA amplification technique may be used to amplify a sample and SDSI. In one example embodiment, the RNA or DNA amplification is an isothermal amplification. The isothermal amplification may be nucleic-acid sequenced-based amplification (NASBA), recombinase polymerase amplification (RPA), loop-mediated isothermal amplification (LAMP), strand displacement amplification (SDA), helicase-dependent amplification (HDA), or nicking enzyme amplification reaction (NEAR). In certain example embodiments, non-isothermal amplification methods may be used which include, but are not limited to, PCR, multiple displacement amplification (MDA), rolling circle amplification (RCA), ligase chain reaction (LCR), or ramification amplification method (RAM).
In one example embodiment, the present invention is used to improve any method of sequencing wherein the nucleic acids to be sequenced are amplified (i.e., amplicon-based methods). In certain example embodiments, the amplification method preferentially amplifies a contaminant nucleic acid if it is present in a sample. In preferred embodiments, samples comprising a pathogen of interest are sequenced. In more preferred embodiments, the pathogen of interest includes variants that can be clustered into families or a lineage. As used herein, the term “variant” refers to any virus having one or more mutations as compared to a known virus. A strain is a genetic variant or subtype of a virus. The terms ‘strain’, ‘variant’, and ‘isolate’ may be used interchangeably. In certain embodiments, a variant has developed a “specific group of mutations” that causes the variant to behave differently than that of the strain it originated from. In certain example embodiments, the families of variants are important for tracking and responding to epidemics and pandemics. For example, sequencing can be used to determine variants that are emerging as the dominant variants causing disease or are spreading more quickly. In another example, sequencing variants can be used to track community transmission and superspreading events (see e.g., Lemieux et al., 2020). Variants may also include those that are resistant to a specific treatment, such as drug resistance. In certain embodiments, variants are associated with more severe disease. As used herein, the term “epidemic” refers to the rapid spread of disease to a large number of people in a given population within a short period of time or the occurrence of more cases of disease, injury, or other health condition than expected in a given area or among a specific group of persons during a particular period. For example, in meningococcal infections, an attack rate in excess of 15 cases per 100,000 people for two consecutive weeks is considered an epidemic. Epidemics of infectious disease are generally caused by several factors including a change in the ecology of the host population (e.g., increased stress or increase in the density of a vector species), a genetic change in the pathogen reservoir or the introduction of an emerging pathogen to a host population (by movement of pathogen or host). Generally, an epidemic occurs when host immunity to either an established pathogen or newly emerging novel pathogen is suddenly reduced below that found in the endemic equilibrium and the transmission threshold is exceeded. An epidemic may be restricted to one location; however, if it spreads to other countries or continents and affects a substantial number of people, it may be termed a pandemic. Effective preparations for a response to a pandemic are multi-layered. The first layer is a disease surveillance system, which includes sequencing of all variants in a population. In certain embodiments, sequencing contaminants that were amplified from a sample would provide an incorrect identification and clustering of the variants.
Any method of sequencing variants in pathogens, such as viral pathogens, is applicable to the present invention (see e.g., Lemieux et al., 2020). Current sequencing methods all suffer from the risk of contamination and the user would be blind to whether the results were accurate.
In certain example embodiments, a pathogen with a DNA genome is sequenced. Sequencing may include whole genome sequencing. Whole genome sequencing (also known as WGS, full genome sequencing, complete genome sequencing, or entire genome sequencing) is the process of determining the complete DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast. “Whole genome amplification” (“WGA”) refers to any amplification method that aims to produce an amplification product that is representative of the genome from which it was amplified. In certain embodiments, the SDSIs of the present invention are added at the amplification step. Non-limiting WGA methods include Primer extension PCR (PEP) and improved PEP (I-PEP), Degenerated oligonucleotide primed PCR (DOP-PCR), Ligation-mediated PCR (LMP), T7-based linear amplification of DNA (TLAD), and Multiple displacement amplification (MDA).
In certain example embodiments, the present invention includes whole exome sequencing. Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding genes in a genome (known as the exome) (see, e.g., Ng et al., 2009, Nature volume 461, pages 272-276). It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology. In certain embodiments, whole exome sequencing is used to determine germline mutations in genes associated with disease.
In certain example embodiments, targeted sequencing is used in the present invention (see, e.g., Mantere et al., PLoS Genet 12 e1005816 2016; and Carneiro et al. BMC Genomics, 2012 13:375). Targeted gene sequencing panels are useful tools for analyzing specific mutations in a given sample. Focused panels contain a select set of genes or gene regions that have known or suspected associations with the disease or phenotype under study. In certain embodiments, targeted sequencing is used to detect mutations associated with a disease in a subject in need thereof. Targeted sequencing can increase the cost-effectiveness of variant discovery and detection. In certain embodiments, targeted sequencing includes amplification and the SDSIs of the present invention are added at the amplification step.
In one example embodiment, the mitochondrial genome from more than one sample is sequenced. In certain embodiments, mitochondrial genome sequencing includes amplification and the SDSIs of the present invention are added at or before the amplification step. An exemplary method includes MitoRCA-seq (see e.g., Ni et al., MitoRCA-seq reveals unbalanced cytocine to thymine transition in Polg mutant mice. Sci Rep. 2015 Jul. 27; 5:12049. doi: 10.1038/srep12049). The method employs rolling circle amplification, which enriches the full-length circular mtDNA by either custom mtDNA-specific primers or a commercial kit and minimizes the contamination of nuclear encoded mitochondrial DNA (Numts). In certain embodiments, RCA-seq is used to detect low-frequency mtDNA point mutations starting with as little as 1 ng of total DNA.
In another example embodiment, multiple displacement amplification (MDA) is used to generate a sequencing library. Multiple displacement amplification (MDA, is a non-PCR-based isothermal method based on the annealing of random hexamers to denatured DNA, followed by strand-displacement synthesis at constant temperature (Blanco et al. J. Biol. Chem. 1989, 264, 8935-8940). It has been applied to samples with small quantities of genomic DNA, leading to the synthesis of high molecular weight DNA with limited sequence representation bias (Lizardi et al. Nature Genetics 1998, 19, 225-232; Dean et al., Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 5261-5266). As DNA is synthesized by strand displacement, a gradually increasing number of priming events occur, forming a network of hyper-branched DNA structures. The reaction can be catalyzed by enzymes such as the Phi29 DNA polymerase or the large fragment of the Bst DNA polymerase. The Phi29 DNA polymerase possesses a proofreading activity resulting in error rates 100 times lower than Taq polymerase (Lasken et al. Trends Biotech. 2003, 21, 531-535). In certain embodiments, the SDSIs of the present invention are added to samples and amplified during MDA or in a subsequent amplification step.
In one example embodiment, is sequencing comprises sequencing of SARS-CoV-2 variants. The scale of the SARS-CoV-2 pandemic has led to a particular focus on reducing the cost and time of amplicon-based methods, often at the cost of slightly reduced sensitivity. However, viral loads of SARS-CoV-2 can vary widely between individuals, in particular when samples are caught early in infection or follow-up sampling is needed. An open-access tiled primer set developed by the ARTIC network is the most widely used method for SARS-CoV-2 specific genome amplification followed by sequencing on either Illumina or nanopore instruments (Quick et al., 2017; Tyson et al., 2020). A wide array of protocols and publications are now available that integrate these ARTIC primers with different amplification and library construction indexing strategies (Baker et al., 2020; Gohl et al., 2020). Approaches such as batching samples by viral load to increase sensitivity are impractical to scale to current needs, resulting in incomplete recovery of viral genomes, especially from low titer samples.
In certain embodiments, the methods described herein can be used to sequence viral samples with low viral loads. A viral load may also be interchangeably referred to as viral burden or viral titer. A viral load may be expressed in viral particles per mL, infectious particles per mL, copies per mL, or virus per mL. A low viral load may be a cycle threshold (CT)>30 or copies per mL<104. A high viral load may be a CT<30 or par or copies per mL >105. For example, viral loads lower than 10,000, 1,000, 500, 400, 300, 200, 100, 50, 40, 30, 20, 10 viral particles. In certain embodiments, a single viral particle is sequenced.
In certain embodiments, the SDSI is used to detect and prevent contamination in genomic analysis samples of pathogens. A pathogen may include viruses, bacteria, fungi, and protozoa. In certain embodiments, a virus may belong to any morphological category including helical, envelope, or icosahedral. In certain embodiments, a virus me comprise of DNA or RNA, may be single stranded or double stranded, and may be linear or circular. In certain embodiments, the genome of the virus may be one nucleic acid molecule or several nucleic acid segments. In certain embodiments a virus may belong to the family: Adenoviridae, Papovaviridae, Parvoviridae, Herpesviridae, Poxviridae, Anelloviridae, Pleolipoviridae, Reoviridae, Picornaviridae, Caliciviridae, Togaviridae, Arenaviridae, Flaviviridae, Orthomyxoviridae, Paramyxoviridae, Bunyaviridae, Rhabdoviridae, Filoviridae, Astroviridae, Bornaviridae, Arteriviridae, Hepeviridae, Retroviridae, Caulimoviridae, Hepadnaviridae, Coronaviridae. In certain embodiment, the virus is SARS-CoV-2. (Gelderblom HR. Structure and Classification of Viruses. In: Baron S, editor. Medical Microbiology. 4th edition. Galveston (Tex.): University of Texas Medical Branch at Galveston; 1996. Chapter 41)
In an exemplary embodiment, the pathogen sequenced is a coronavirus. As used herein, “coronavirus” refers to enveloped viruses with a positive-sense single-stranded RNA genome and a nucleocapsid of helical symmetry that constitute the subfamily Orthocoronavirinae, in the family Coronaviridae (see, e.g., Woo P C, Huang Y, Lau S K, Yuen K Y. Coronavirus genomics and bioinformatics analysis. Viruses. 2010; 2(8):1804-1820). Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the virus causing the ongoing Coronavirus Disease 19 (COVID19) pandemic (see, e.g., Zhou, et al. (2020). A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270-273). In preferred embodiments, the virus is SARS-CoV-2 or variants thereof. In preferred embodiments, the disease treated is COVID-19. SARS-CoV-2 is the third zoonotic betacoronavirus to cause a human outbreak after SARS-CoV in 2002 and Middle East respiratory syndrome coronavirus (MERS-CoV) in 2012 (de Wit et al., 2016, SARS and MERS: recent insights into emerging coronaviruses. Nat Rev Microbiol 14, 523-534). While there are many thousands of variants of SARS-CoV-2, (Koyama, Takahiko Koyama; Platt, Daniela; Parida, Laxmi (June 2020). “Variant analysis of SARS-CoV-2 genomes”. Bulletin of the World Health Organization. 98: 495-504) there are also much larger groupings called clades. Several different clade nomenclatures for SARS-CoV-2 have been proposed. As of December 2020, GISAID, referring to SARS-CoV-2 as hCoV-19 identified seven clades (O, S, L, V, G, GH, and GR) (Alm E, Broberg E K, Connor T, et al. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to June 2020 [published correction appears in Euro Surveill. 2020 August; 25(33):]. Euro Surveill. 2020; 25(32):2001410). Also as of December 2020, Nextstrain identified five (19A, 19B, 20A, 20B, and 20C) (Cited in Alm et al. 2020). Guan et al. identified five global clades (G614, S84, V251, 1378 and D392) (Guan Q, Sadykov M, Mfarrej S, et al. A genetic barcode of SARS-CoV-2 for monitoring global distribution of different clades during the COVID-19 pandemic. Int J Infect Dis. 2020; 100:216-223). Rambaut et al. proposed the term “lineage” in a 2020 article in Nature Microbiology; as of December 2020, there have been five major lineages (A, B, B.1, B.1.1, and B.1.777) identified (Rambaut, A.; Holmes, E. C.; O'Toole, A.; et al. “A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology”. 5: 1403-1407).
Exemplary, non-limiting variants applicable to the present invention are described below. Genetic variants of SARS-CoV-2 have been emerging and circulating around the world throughout the COVID-19 pandemic (see, e.g., The US Centers for Disease Control and Prevention; www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html). Exemplary, non-limiting variants applicable to the present disclosure include variants of SARS-CoV-2, particularly those having substitutions of therapeutic concern. Table A shows exemplary, non-limiting genetic substitutions in SARS-CoV-2 variants.
Phylogenetic Assignment of Named Global Outbreak (PANGO) Lineages is software tool developed by members of the Rambaut Lab. The associated web application was developed by the Centre for Genomic Pathogen Surveillance in South Cambridgeshire and is intended to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the PANGO nomenclature. It is available at cov-lineages.org.
In some embodiments, the SARS-CoV-2 variant is and/or includes: B.1.1.7, also known as Alpha (WHO) or UK variant, having the following spike protein substitutions: 69del, 70del, 144del, (E484K*), (S494P*), N501Y, A570D, D614G, P681H, T716I, S982A, and D1118H (K1191N*); B.1.351, also known as Beta (WHO) or South Africa variant, having the following spike protein substitutions: D80A, D215G, 241del, 242del, 243del, K417N, E484K, N501Y, D614G, and A701V; B.1.427, also known as Epsilon (WHO) or US California variant, having the following spike protein substitutions: L452R, and D614G; B.1.429, also known as Epsilon (WHO) or US California variant, having the following spike protein substitutions: S131, W152C, L452R, and D614G; B.1.617.2, also known as Delta (WHO) or India variant, having the following spike protein substitutions: T19R, (G142D), 156del, 157del, R158G, L452R, T478K, D614G, P681R, and D950N; P.1, also known as Gamma (WHO) or Japan/Brazil variant, having the following spike protein substitutions: L18F, T20N, P26S, D138Y, R190S, K417T, E484K, N501Y, D614G, H655Y, and T10271; and B.1.1.529 also known as Omicron (WHO), having the following spike protein substitutions: A67V, del69-70, T95I, del142-144, Y145D, del211, L212I, ins214EPE, G339D, S371L, S373P, S375F, K417N, N440K, G446S, S477N, T478K, E484A, Q493R, G496S, Q498R, N501Y, Y505H, T547K, D614G, H655Y, N679K, P681H, N764K, D796Y, N856K, Q954H, N969K, L981F, or any combination thereof.
In some embodiments, the SARS-CoV-2 variant is classified and/or otherwise identified as a Variant of Concern (VOC) by the World Health Organization and/or the U.S. Centers for Disease Control. A VOC is a variant for which there is evidence of an increase in transmissibility, more severe disease (e.g., increased hospitalizations or deaths), significant reduction in neutralization by antibodies generated during previous infection or vaccination, reduced effectiveness of treatments or vaccines, or diagnostic detection failures.
In some embodiments, the SARS-Cov-2 variant is classified and/or otherwise identified as a Variant of High Consequence (VHC) by the World Health Organization and/or the U.S. Centers for Disease Control. A variant of high consequence has clear evidence that prevention measures or medical countermeasures (MCMs) have significantly reduced effectiveness relative to previously circulating variants.
In some embodiments, the SARS-Cov-2 variant is classified and/or otherwise identified as a Variant of Interest (VOI) by the World Health Organization and/or the U.S. Centers for Disease Control. A VOI is a variant with specific genetic markers that have been associated with changes to receptor binding, reduced neutralization by antibodies generated against previous infection or vaccination, reduced efficacy of treatments, potential diagnostic impact, or predicted increase in transmissibility or disease severity.
In some embodiments, the SARS-Cov-2 variant is classified and/or is otherwise identified as a Variant of Note (VON). As used herein, VON refers to both “variants of concern” and “variants of note” as the two phrases are used and defined by Pangolin (cov-lineages.org) and provided in their available “VOC reports” available at cov-lineages.org.
In some embodiments the SARS-Cov-2 variant is a VOC. In some embodiments, the SARS-CoV-2 variant is or includes an Alpha variant (e.g., Pango lineage B.1.1.7), a Beta variant (e.g., Pango lineage B.1.351, B.1.351.1, B.1.351.2, and/or B.1.351.3), a Delta variant (e.g., Pango lineage B.1.617.2, AY.1, AY.2, AY.3 and/or AY.3.1); a Gamma variant (e.g., Pango lineage P.1, P.1.1, P.1.2, P.1.4, P.1.6, and/or P.1.7), a Omicon variant (B.1.1.529) or any combination thereof.
In some embodiments the SARS-Cov-2 variant is a VOL In some embodiments, the SARS-CoV-2 variant is or includes an Eta variant (e.g., Pango lineage B.1.525 (Spike protein substitutions A67V, 69del, 70del, 144del, E484K, D614G, Q677H, F888L)); an Iota variant (e.g., Pango lineage B.1.526 (Spike protein substitutions LSF, (D80G*), T95I, (Y144-*), (F157S*), D253G, (L452R*), (5477N*), E484K, D614G, A701V, (T859N*), (D950H*), (Q957R*))); a Kappa variant (e.g., Pango lineage B.1.617.1 (Spike protein substitutions (T95I), G142D, E154K, L452R, E484Q, D614G, P681R, Q1071H)); Pango lineage variant B.1.617.2 (Spike protein substitutions T19R, G142D, L452R, E484Q, D614G, P681R, D950N)), Lambda (e.g., Pango lineage C.37); or any combination thereof.
In some embodiments SARS-Cov-2 variant is a VON. In some embodiments, the SARS-Cov-2 variant is or includes Pango lineage variant P.1 (alias, B.1.1.28.1.) as described in Rambaut et al. 2020. Nat. Microbiol. 5:1403-1407) (spike protein substitutions: T20N, P26S, D138Y, R190S, K417T, E484K, N501Y, H655Y, TI0271)); an Alpha variant (e.g., Pango lineage B.1.1.7); a Beta variant (e.g., Pango lineage B.1.351, B.1.351.1, B.1.351.2, and/or B.1.351.3); Pango lineage variant B.1.617.2 (Spike protein substitutions T19R, G142D, L452R, E484Q, D614G, P681R, D950N)); an Eta variant (e.g., Pango lineage B.1.525); Pango lineage variant A.23.1 (as described in Bugembe et al. medRxiv. 2021. doi: https://doi.org/10.1101/2021.02.08.21251393) (spike protein substitutions: F157L, V367F, Q613H, P681R); or any combination thereof.
In certain embodiments, the pathogen sequenced is a pathogenic bacteria and may include: spirochetes; Spirilla; vibrios; gram-negative aerobic rods and cocci; enterics; pyogenic cocci; and endospore-forming bacteria; actinomycetes and related bacteria; rickettsias and chlamydiae; mycoplasmas, which are groups defined by some bacteriological criteria. A pathogenic bacteria may include: Escherichia coli, Salmonella enterica, Salmonella typhi, Shigella dysenteriae, Yersina pestis, Pseudomonas aeruginosa, Vibrio cholerae, Bordetella pertussis, Haemophilus influenza, Helicobacter pylori, Campylobacter jejuni, Neisseria gonorrhoeae, Neisseria meningitidis, Brucella abortus, Bacteroides fragilis, Staphylococcus aureus, Streptococcus pyogenes, Streptococcus pneumoniae, Bacillus anthracis, Bacillus cereus, Clostridium tetani, Clostridium perfringens, Clostridium botulinum, Clostridium difficile, Corynebacterium diphtherias, Listeria monocytogenes, Mycobacterium tuberculosis, Mycobacterium leprae, Chlamydia trachomatis, Chlamydia pneumoniae, Mycoplasma pneumoniae, Rickettisas, Treponema pallidum, Borrelia burgdorferi, or a variant thereof (Todar, K. Textbook of Bacteriology (2020) Online)
In an exemplary embodiment, the pathogen sequenced is a pathogenic fungi and may include: Aspergillus; Blastomyces; Candida; Coccidioides; Cryptococcus; Fusarium; Microsporum; Epidermophyton; Trichophyton; Histoplasma; Rhizopus; Mucor; Rhizomucor; Syncephalastrum; Cunninghamella; Apophysomyces; Lichtheimia (formerly Absidia); Eumycetoma; Pneumocystis; Trichophyton; Microsporum; Epidermophyton; Sporothrix; Paracoccidioides; Talaromyces or a variant or species thereof. (CDC)
In an exemplary embodiment, the pathogen sequenced is a pathogenic protozoa belonging to the group: Sarcodina; Mastigophora; Ciliophora; or Sporozoa defined by their mode of movement. (CDC) In certain embodiments, the pathogenic protozoa may include: Entamoeba; Trichomonas; Leishmania; Chilomonas; Giardia; Isopora; Sarcocystis; Nosema; Balantidium; Eimeria; Histomonas; Trypanosoma; Plasmodium; Babesia; or Haemoproteus or a variant or species thereof.
Further embodiments are illustrated in the following Examples which are given for illustrative purposes only and are not intended to limit the scope of the invention.
Here Applicants designed, optimized, and implemented a novel sample identification method using synthetic DNA spike-ins (SDSIs) that is broadly compatible with SARS-CoV-2 sequencing approaches and settings. Applicants implemented these SDSIs for Illumina sequencing with SARS-CoV-2 specific amplification using the ARTIC consortium's amplicon designs. To maximize epidemiological utility by increasing the number of genomes recovered from samples with low viral loads, Applicants benchmarked key amplification and library construction steps. Applicants propose a modified protocol, hereafter termed SDSI+ARTIC, that provides increased confidence in the veracity of genomes with minimal extra cost and time that can be applied to investigations of SARS-CoV-2 epidemiology and emerging viral variants (
Applicants sought to design a robust system for contamination tracing and sample tracking applicable to a wide-variety of viral sequencing strategies via known synthetic DNA sequences. Applicants envisioned that these novel synthetic DNA spike-ins (SDSIs) would consist of a uniquely identifiable sequence such that each sample in a sequencing batch could be paired with a different SDSI, enabling in-sample labeling. SDSIs should be sufficiently distinct from one another as well as common laboratory or human pathogens to ensure reliable identification. Each unique sequence is then flanked by constant priming regions so that a single additional primer set can be integrated into a multiplexed PCR to co-amplify the SDSI with the sample (
Excerpting DNA sequences from diverse, exotic archaea genomes to serve as the unique portion of the SDSI precludes false detection and cross-identification. To balance common sequencing library construction constraints, DNA synthesis costs, and providing enough sequence to be uniquely identifiable, Applicants generated SDSIs with a 140 bp stretch of variable sequence. Applicants confirmed that the various SDSIs were significantly different from each other to mitigate cross-identification; among all SDSIs, the minimum pairwise Hamming distances of the 140 bp stretch of unique sequence was 84 (mean=105; max=121). Since false detection of SDSI would occur if its sequence shared significant homology with other genetic material in a sample, Applicants based these sequences on archaea, which are divergent from organisms found in typical laboratory or clinical settings (Table 2). A permissive search performed against the entire NCBI database confirmed that 44/48 SDSI sequences had significant homology (>75% sequence identity over >75% query cover) exclusively within the domain archaea; the remaining SDSIs had homology to a handful of bacterial genuses unlikely to be found in laboratories (Table 2). In considering the application of these SDSIs to ARTIC SARS-CoV-2 amplicon sequencing, Applicants also specifically verified that each unique SDSI sequences were unlikely to be confused with expected COVID-19 clinical sample content, confirming that each sequence had very limited homology (nothing >50% sequence identity over >50% query cover) to both Homo sapiens and SARS-CoV-2. In designing these amplicon sequences Applicants also avoided extremes of GC content (range: 35-65%) in order to promote similar amplification rates across different SDSIs, as well as other potential targets of the multiplexed reaction, such as viral amplicons. Applicants specifically ensured that the SDSIs had similar GC content to ARTIC SARS-CoV-2 amplicons (
Similarly, the design of common primers for SDSI amplicons enabled compatibility with a broad spectrum of amplicon-based sequencing reactions, including in clinical settings. To preclude off-target priming in the PCR reaction that could outcompete amplification of a primary target, Applicants limited SDSI primer homology to common organisms, particularly on the 3′ end of the primer. Applicants specifically confirmed that primers were unlikely to amplify human or SARS-CoV-2 to promote SDSI primer integration into the ARTIC SARS-CoV-2 amplicon sequencing PCR reaction. Primers were compatible with ARTIC v3 primer sets, with a similar length (24 bps each) and GC content (45.8% each) (
Applicants demonstrated that the addition of SDSIs into the ARTIC multiplexed PCR provided a sample-specific internal control and did not detrimentally affect the amplification of SARS-CoV-2 RNA. SDSI primers did not produce any nonspecific amplification, including in the presence of NP swab RNA, supporting the expectation that primers shared limited homology with genomic material from clinical samples (
Applicants performed SDSI+ARTIC sequencing on a batch of 48 SARS-CoV-2+clinical samples to demonstrate its feasibility and utility in tracking samples and identifying contamination. After adding a different SDSI to each sample, Applicants found that 47/48 SDSIs were identified exclusively in the anticipated sample, validating the use of SDSIs as an internal control for sample tracking. One SDSI (SDSI 48) was detected in the sample that it was added to as well as a neighboring sample in the batch (
As shorter amplicons have been purported to yield superior recovery for low viral load samples (Antonov et al., 2005; No et al., 2019)), Applicants explored extending SDSIs to the Paragon Genomics' CleanPlex SARS-CoV-2 panel, but identified fatal shortcomings. Paragon amplicons are on average half the size of ARTIC (149 bp vs 343 bp), and compatible with the SDSI length 140 bp. (Antonov et al., 2005; No et al., 2019) (SARS-CoV-2 COVID-19 Coronavirus Research and Surveillance, n.d.)(Antonov et al., 2005; No et al., 2019). However, the Paragon panel had dropout regions even in low CT samples which resulted in missed SNP calls compared to ARTIC across 5 samples (CTs=20-37), consistent with other reports (
Applicants benchmarked various alterations to Illumina-based SDSI+ARTIC sequencing in order to maximize the number of complete, high-quality genomes recovered from clinically diverse samples. Higher CT samples prove especially challenging to sequence but their recovery is still of critical importance to epidemiological and clinical applications of viral genomics. Applicants found that substituting a more processive reverse transcriptase provided the single biggest benefit. Comparing cDNA produced with Superscripts III, IV, or IV-VILO across a range of clinical CTs (low CT: <20, mid-low CT: 20-25, mid-high CT: 25-30, and high CT: >30), SSIV-VILO and SSIV produced the highest number of amplicons with at least 10× coverage across 13 samples (SSIII: 72.64%, SSIV: 81.93%, SSIV-VILO: 86.97%) (
Applicants also attempted protocol modifications to increase sequence depth uniformity in SDSI+ARTIC, which is crucial for recovering complete genomes in the fewest number of reads. When Applicants increased (2×) primer concentrations (20.8 nM final) for low efficiency amplicons, Applicants observed increased coverage in these amplicons that enabled whole genome recovery for multiple samples, especially those with higher CTs (
Applicants reduced the potential for highly amplified library contamination within the laboratory or clinical setting by scaling down (0.5×) the Illumina DNA Flex library construction kit, which also reduced per sample cost without impacting performance (Table 5; Table 6). In benchmarking library construction methods, Applicants confirmed Nextera DNA Flex generated greater coverage depth than DNA XT (
Highlighting the reliability and robustness of this approach, Applicants observed high sequence correlation and superior genome recovery with SDSI+ARTIC compared to an unbiased metagenomics approach, the gold standard in generating error-free viral genomes. Applicants sequenced a small batch of six samples (CTs=16-31) using ARTIC without SDSIs, and generated full length genomes with 100% concordance to those generated with metagenomic sequencing, indicating the accuracy of ARTIC-based sequencing methods (Lemieux et al., 2021). Applicants then resequenced 89 unique patient samples with SDSI+ARTIC that were previously sequenced using the same standard metagenomics approach (Lemieux et al., 2021) to serve as a direct comparison. The 89 samples in the validation batch consisted of diverse viral lineages and a broad range of CTs (range=11.9-37.4; mean=27.4) (
SDSI+ARTIC displayed high concordance in sequence variant identification to metagenomics, producing only two divergent SNP calls out of 331 total SNPs across 38 genomes (
SDSI+ARTIC is a powerful method for public health interventions, especially as superspreading events—and clusters of cases linked to close contact settings more broadly—have become a defining feature of the SARS-CoV-2 pandemic ((Adam et al., 2020; Dearlove et al., 2020; Lemieux et al., 2021; Wong & Collins, 2020)). Viral genomes can reveal whether these clusters are linked through transmission, based on shared viral sequences, providing useful information for public health interventions. Such outbreak investigations of single cases leading to many are distinguishable due to low viral sequence variation but requires higher levels of confidence to ensure such a pattern has not occurred due to laboratory contamination. To demonstrate the utility of the novel SDSIs and modified protocol, Applicants applied the method to investigate a putative cluster of 14 SARS-CoV-2 cases from Massachusetts General Hospital (MGH), for which the infection control unit had suspicion of a nosocomial outbreak. Applicants sequenced 24 samples; 14 samples believed to be part of the cluster based on traditional contact-tracing, 8 unlinked samples and 2 negative controls.
The SDSI+ARTIC method enabled fast and confident identification of a nosocomial cluster, with samples processed within 24 hours and final genomes assembled within 52 hours of bio-sample receipt. Applicants assembled 14 complete genomes (>98% complete) of which 9 were from cluster-associated samples. Those samples that did not yield a full genome were those with lower viral loads (CT>30). Phylogenetic analysis showed that samples from the cluster were genetically highly similar and clustered together (
As the SARS-CoV-2 pandemic intensifies and new genomic variants continue to emerge, it is imperative to build robust experimental confidence into genomic surveillance data interpretation. Here, Applicants report a novel design and implementation of Synthetic DNA Spike-ins (SDSI) as an essential component for tracking and tracing contamination, a potential confounder in amplicon-based sequencing methods of SARS-CoV-2. The in-silico design generated robust synthetic targets at low costs while mitigating inter-spike-in sequence homology as well as homology with human, SARS-CoV-2, and common laboratory reagents. While broadly applicable to most amplicon-based approaches, as a proof-of-principle Applicants coupled the SDSIs to an improved ARTIC amplicon sequencing protocol yielding faster throughput with an overall reduced cost compared to existing Illumina DNA Flex-based protocols.
SDSIs can readily be adopted by laboratories and platforms of all sizes with only minor changes to existing methodologies, little additional cost per sample ($0.006), and no interruption to standard workflow methodologies. Additional synthetic targets could be designed using the same principles to expand into 384 well formats and beyond. Primer sites could also be modulated for integration with new advancements in amplicon sequencing, like tailed primer approaches (Gohl et al., 2020). More broadly, standardizing controls across the viral surveillance community will increase accuracy and integrity of SARS-CoV-2 genomic data worldwide. These SDSIs not only enable profiling of in-batch contamination, but also laboratory-wide detection as their presence in other data (amplicon, metagenomic, qPCR, or otherwise) would indicate a tagged amplification and thus contamination. Moreover the approach is applicable to both Illumina and Nanopore sequencing platforms as well as any other existing or future tiled amplicon panel, such as those previously used for Zika, Ebola, and other recent outbreaks (Quick et al., 2016) (Metsky et al., 2017). SDSIs could serve as a broad tool for tracing potential contamination across a plethora of fields that employ amplicon based genomic sequencing, such as food safety, species identification or environmental sampling.
In optimizing the SDSI+ARTIC protocol Applicants tested and incorporated a number of cost and time saving adjustments. Modifications that can be used include implementing liquid handlers in high volume settings such as public health laboratories. Additional methodological improvements could allow for direct PCR amplification of SARS-CoV-2 using primers with indexing adapter compatible ends (Baker et al., 2020; Gohl et al., 2020) or the inclusion of unique molecular identifiers to understand intra-host variation. The SDSIs were designed to be compatible with such potential future approaches. Applicants note that there is still considerable non-uniformity in per-amplicon coverage for samples with low viral loads highlighting the need for methods that can confidently capture this information. A recent update to the ARTIC protocol for nanopore suggests that a change in the annealing temperature from 65° C. to 63° C. can reduce dropout of amplicon 64 (Tyson et al., 2020), a particularly poorly performing amplicon. The results show that 2× primer concentration for a subset of underperforming amplicons improved performance, and matching primer concentrations with amplicon efficiency would likely yield more uniform coverage (Table 4). Alternative approaches for the recovery of genomes from samples with low viral load include the use of targeted enrichment approaches (Houldcroft et al., 2017; Metsky et al., 2019) are more costly and time-consuming.
Amplicon based sequencing methods fill a critical need for rapid turn around and full genome recovery for epidemiological surveillance where SNP identification is crucial. While benchmarking the modified protocol against the gold standard approach of metagenomics Applicants observed discordant SNPs were rare (2/331). This emphasizes the need for caution and replication of libraries for highly important samples. Other commercial amplicon-based designs such as those by Paragon Genomics are significantly faster workflows and use smaller size amplicons, but the ARTIC primer set results in better overall coverage for the majority of samples (up to CT=35) and genome accuracy. Applicants believe subsequent generations of amplicon-based sequencing will address this pressing need pushing cost down while increasing genomic surveillance accuracy, which is sorely needed in the public health setting. The rapid deployment of SDSI+ARTIC confirming a nosocomial infection cluster further emphasizes the utility of the SDSIs to confidently identify samples of high genetic similarity.
The potential emergence of SARS-CoV-2 immune and vaccine escape variants underscores the ongoing necessity of accurate, reliable, and accessible genome sequencing. The modifications and suggestions build upon a remarkable global genomic surveillance response that has developed new tools for the rapid sequencing of viral genomes at an unprecedented rate. In light of the latest surges in SARS-CoV-2 cases globally and the emergence of more transmissible lineages and variants of concern that are rising in frequency in multiple continents, continual innovation in these protocols to improve their efficiency, cost-effectiveness and reliability are essential to meet the growing need for genomic surveillance of SARS-CoV-2. Moreover, stringent sample tracking and contamination detection strategies must become a standard practice, maximizing the utility of genomic data and its increasing importance for shaping public health interventions.
Applicants designed a simple and flexible system for sample tracking and contamination tracing using a core uniquely identifiable DNA sequence flanked by constant priming regions that satisfy several design requirements. This design allows in-sample tracking through the addition of a different SDSI to each sample during sample processing. Following sequencing, the data can be analyzed for both the presence of the expected SDSI and any other SDSI, illuminating both sample misassignment and contamination with high resolution and accuracy (
Applicants selected a pair of primers and corresponding priming regions on each SDSI that are highly specific and show reliable amplification across SDSIs and under standard PCR conditions. Using Primer-BLAST, Applicants predicted that these sequences had limited homology to common organisms and thus were unlikely to amplify nonspecific templates that could outcompete amplification of a primary target. Experimentally Applicants confirmed that the SDSI primers did not produce any nonspecific amplification, including in the presence of cDNA from a nasopharyngeal (NP) swab sample (
Applicants determined that the addition of SDSIs into the ARTIC multiplexed PCR did not detrimentally affect or otherwise alter the amplification of SARS-CoV-2 cDNA from clinical samples. First, to prevent SDSIs from overtaking the amplification and sequencing of SARS-CoV-2 amplicons, Applicants optimized the amount of SDSI added to each reaction through limited titration. Using a randomly selected SDSI (SDSI 49), Applicants found that the highest concentration tested, 600 copies/μL, resulted in reliable SDSI detection with >96% of reads still mapping to SARS-CoV-2 and no apparent alteration in coverage across the genome (
As extensive PCR can result in the propagation of numerous types of errors, such as DNA polymerase base substitution errors, PCR recombination events, template switching, and thermocycling induced DNA damage, Applicants further compared SARS-CoV-2 genome concordance between the SDSI+AmpSeq method and unbiased, metagenomic sequencing9,10,20. Applicants performed SDSI+AmpSeq on a batch of 89 unique patient samples previously sequenced with unbiased metagenomics21. The samples consisted of diverse viral lineages and a broad range of viral loads (CT range=11.9-37.4; mean=27.4) with the more sensitive amplicon sequencing method generating more complete genomes at higher CTs (
Applicants explored a number of other technical modifications to the ARTIC amplicon sequencing protocol in order to improve genome recovery, limit contamination points, and enhance reproducibility of the SDSI approach. Foremost, increasing cDNA length by use of more processive reverse transcriptases improves amplicon coverage (
The SDSI+AmpSeq method is compatible with a range of viral CTs, SARS-CoV-2 lineages, origin of the patient sample, and laboratory in which the pipeline is implemented demonstrating that this is a robust and flexible approach that can be readily implemented for surveillance. A half plate of SDSIs were used at two large-scale sequencing facilities, the Broad Institute and Jackson Laboratories (JAX), for SDSI+AmpSeq SARS-CoV-2 surveillance across a total of 6,741 clinical samples and controls (JAX: N=3,838; Broad: N=2,903). Individual batches typically consisted of 92 clinical samples with 4 designated water controls. Clinical samples were largely from Maine, Massachusetts, and Rhode Island from December 2020 to July 2021 and covered a wide range of viral CT values (CT 8.4-39.9) and pango lineages (77 total lineages) (
The SDSI+AmpSeq is a tractable and easily-implemented method for genome quality control when applied to high-throughput processing of clinical samples. Across thousands of clinical samples, the SDSIs performed consistently and reliably (
SDSIs enable detection of sample swaps and contamination events that occur in large scale batch processing which may otherwise go undetected. In a controlled experiment, Applicants demonstrated that the SDSI+AmpSeq approach provides a feasible method to accurately detect contamination. Applicants mixed two SDSIs at various ratios prior to the ARTIC PCR and found that those SDSI ratios were reflected in the sequencing output (
SDSI+AmpSeq also enables fine-resolution insight into sample processing errors with high specificity. In one example, SDSI counts indicated columns were unintentionally mixed together (
To demonstrate the application of the SDSIs for confident interpretation of sequencing data Applicants used SDSI+AmpSeq to investigate a putative SARS-CoV-2 cluster from Massachusetts General Hospital (MGH) for which the Infection Control Unit suspected nosocomial transmission, a context in which both sample swaps and contamination could easily undermine findings. Applicants sequenced 22 samples with SDSI+AmpSeq (14 samples suspected to be part of the cluster based on epidemiological contact-tracing and 8 unlinked samples as controls), within 24 hours and final genomes were assembled within 52 hours of biosample receipt. Of the 11 samples that Applicants assembled genomes from that were suspected to be part of the cluster, 10 were genetically highly similar (0-1 consensus nucleotide difference) (
To further increase the confidence in AmpSeq methods for viral genomics, Applicants sought to capture contamination and sample swaps that might occur before the cDNA stage. Applicants explored the feasibility of modifying the SDSI approach to enable synthetic RNA spike-ins (SRSI) from the same constructs, which could be added to clinical sample RNA to provide end-to-end quality control. For a subset of SDSIs, Applicants included a T7 promoter site to enable in-vitro production of these constructs as RNAs. For two clinical samples representing low (20) and mid (26) CTs, Applicants detected reads from the RNA spike-ins added directly to extracted viral RNA as a proof of principle (
Amplicon-based sequencing methods crucially empower rapid, full genome recovery for emerging SARS-CoV-2 variant surveillance; however, robust tools are needed to ensure accuracy in genomic data. SDSI+AmpSeq is a reliable technique for detecting key modes of contamination, addressing this critical gap in standard controls and practices. SDSIs do not compromise genome quality, have been successfully deployed in thousands of clinical samples, and are in use across multiple laboratories with differing protocols. These SDSIs revealed numerous instances of sample swaps and contamination, many of which would go unnoticed with standard batch-level controls. SDSIs further provide critical confidence in the interpretation of clusters of identical genomes, a renewed challenge in the surveillance of more transmissible variants. The common primer design of the SDSI approach enables them to be readily applied to multiple short amplicon designs and sequencing strategies, adding only minor changes to existing protocols and minimal additional cost.
SDSIs overcome multiple modes of error in the production of amplicon-based genomic sequencing data and are a critical component of quality control measures. The approach is most effective when adopted fully within a laboratory setting and thus Applicants propose routine use of the SDSI+AmpSeq method to flag laboratory-wide contamination. Applicants have implemented SDSI's across diverse approaches and provide an extensively tested protocol with ARTIC v3 and Illumina-based tagmentation. It can also be applied to other sequencing pipelines, though this potentially requires further optimization. The pathogen-exclusion design criteria allows the 96 validated SDSIs to be immediately incorporated into other tiled amplicon panels, such as existing ones for Zika, Ebola, and other viruses of epidemic potential26,27.
The SDSI-labeling paradigm is broadly applicable to many amplicon-based needs: amenable to a variety of technical enhancements, flexible to remaining error modes, and expandable to additional targets. One could apply the same design parameters to expand the set of SDSIs, such as to 384 well formats. Additionally, uniquely permuted sets of any size could be created for specific sample batches. To design larger panels of SDSIs, Applicants could use artificial core sequences, rather than excerpting from archaea. Primer sites could also be easily adapted for integration with new advancements in amplicon sequencing, like tailed primer approaches or new primer schemes38-32. In its current implementation, the SDSIs detect contamination or workflow errors that occur during and after amplification, but not issues arising at the RNA or cDNA generation stage, and act qualitatively, rather than quantitatively. Further refinement of the RNA spike-in approach could address other modes of contamination, enabling end-to-end sample tracking at scale. Future work improving quantification and SDSI analysis pipelines may enable them to serve as within sample controls, since samples or batches with outlier SDSI read counts may reveal missing or defective PCR components, incomplete mixing, thermocycling issues, or other types of experimental error.
The integration of SDSIs can mitigate a critical vulnerability of amplicon-based sequencing while preserving the many advantages, increasing the robustness of its use across laboratory and clinical settings. Adoption of controls across the viral surveillance community would increase accuracy and integrity of genomic data worldwide. Looking forward, SDSIs could serve as a crucial component in improving data integrity in amplicon based genomic sequencing beyond infectious disease surveillance, such as food safety, species identification and environmental sampling.
Applicants designed synthetic DNA fragments that each contained a 140 bp unique sequence and constant priming regions. Core SDSI sequence homology to sequences from various organisms was predicted by a permissive BLAST search (blastn; 5000 max targets; E=10; word size=11; no mask for low complexity). Applicants considered homologies identified with this BLASTn search described above that were additionally >50 bps (>35% query cover) and >90% sequence identity to be significant homologies. For all 96 selected SDSIs, there were no such significant homologies when results were filtered to all Homo sapiens (taxid:9606) or viral (taxid:10239) sequences in the NCBI database. For significant homologies to bacterial or eukaryotic sequences in the NCBI database (excluding archaea: taxid:2157), Applicants report both the SDSI and the genus it mapped to in each case (
Applicants confirmed that SDSI primers and amplicons were predicted to amplify specifically and consistently with ARTIC v3 amplicons. Applicants used Primer-BLAST to predict 50-5000 bp amplicons produced on templates in the entire nr database; no amplicons were identified. Applicants calculated the length and GC content of SDSI primers and full SDSI amplicon sequences and ARTIC v3 primers and amplicons using Geneious Prime (2019.2.1) and compared their distributions (
Applicants sought to validate in silico predictions for the performance of the SDSI primers and amplicons. Applicants ordered primers (IDT) (oligo sequences in Supplementary Data File 1) and performed qPCR using the Q5 Hotstart 2× Mastermix, with 500 nM SDSI primers and 0.17×SYBR Gold (ThermoFisher #S11494), and without ARTIC primer pools. Applicants performed this assay in triplicate in 10 μL reactions on a QuantStudio 6 with the following cycling conditions: 95° C. for 30 seconds, followed by 35 cycles of 95° C. for 15 seconds and 65° C. for 5 minutes. Applicants tested 4 conditions: (1) 0.5 μL of an SDSI gene block (IDT) (1 pM), (2) 0.5 μL of an SDSI gene block+0.5 μL of cDNA from an NP swab, (3) 0.5 μL of cDNA from an NP swab, and (4) no template to detect any nonspecific amplification of the primers (
Applicants ordered unique oligos as TruGrade ultramers (IDT), then resuspended and stored them at 10 μM in water (oligo sequences in Table 1). Further characterization for identification of 96 SDSIs was achieved by direct PCR amplification with primers containing the constant SDSI handle and an Illumina P5/P7 adapter followed by sequencing with a Mi Seq Nano 2×150 bp kit (Illumina #MS-102-2002). SDSI reads were quantified by mapping each SDSI against other SDSIs with the align_and_count_multiple_report wdl implemented in Terra, as described below, and purity and sequence fidelity of SDSIs was achieved by calculating the percentage of reads mapping to each SDSI out of total SDSI reads (
Research was conducted at the Broad Institute with an exempt determination from the Broad Office of Research Subjects Protections and with approval from the MIT Institutional Review Board under protocol #1612793224. Samples were obtained from Massachusetts General Hospital (MGH), Massachusetts Department of Public Health, the Rhode Island Department of Public Health and the Broad Institute Clinical Research Sequencing Platform. Samples from Massachusetts General Hospital (MGH) fall under Partners Institutional Review Board under protocol #2019P003305. Samples were secondary-use or residual clinical and diagnostic specimens (referred to collectively throughout as clinical samples), obtained by researchers under a waiver of consent. All samples were nasopharyngeal or anterior nares swabs in a stabilizing medium (e.g., MTM or VTM). These unique biological materials are not available to other researchers as they are human patient samples from clinical excess material and thus are of limited volume. Samples sequenced at Jackson Laboratories (JAX) were approved under protocol 2020-NHSR-019-BH.
Viral cycle threshold (CT) for all samples sequenced at the Broad Institute were obtained using the CDC RT-qPCR assay with the N1 probe as previously described21. Viral CTs for samples sequenced at JAX were obtained from various providers and thus the RT-qPCR assays used are variable.
CT normalization was performed by first setting a desired mock viral CT and calculating the difference between this desired mock viral CT and the measured viral CT of a given sample, rounding to the nearest whole number. Applicants next calculated the number of doublings required for the mock viral CT (assuming 100% PCR efficiency) and multiplied this by the volume of cDNA input to be used for the normalization. The final volume of water used to dilute the cDNA was the doubling factor minus the volume of cDNA input. An example calculation is illustrated below:
This CT normalization was done for certain method development samples which are described throughout the manuscript as being “mock diluted” or “normalized to CT X”. The nosocomial cluster was normalized to CT 27. The majority of batch data generated at the Broad Institute underwent CT normalization to CT 25. Batch data from JAX did not undergo CT normalization. CT normalization of the cDNA prior to the ARTIC PCR should reduce the potential for generating excessively large libraries from very high viral load samples, keep the percentage of SDSI reads in a detectable range (
cDNA Generation and ARTIC Amplification Optimization
Applicants tested reverse transcriptase enzymes using extracted RNA from four SARS-CoV-2 positive clinical samples (CTs=13.9, 23.9, 29.6, 33.6) (
Applicants tested PCR enzyme efficiency using extracted RNA from SARS-CoV-2 positive clinical samples followed by cDNA generation using SuperScript IV and diluted the resulting cDNA to a mock CT value of 35 for standardization across all PCR enzyme tests. Applicants set up the standard ARTIC PCR pool #1 and pool #2 using an input of 2.5 μL, altering only the PCR enzyme and corresponding buffer. Applicants tested NEB Q5 Hot Start High-fidelity 2× Master Mix (Q5 2× MM) (NEB #M0494L), NEB Q5 Hot Start High-fidelity 2× Master Mix plus 0.01% SDS, NEB Q5 Ultra II Master Mix (NEB #M0544L), KAPA HiFi HotStart (Roche #KK2601), and KOD Hot Start DNA polymerase (Sigma-Aldrich #71842) (
Applicants optimized PCR cycling conditions on mock CT 35 cDNA (generated as described above) using standard ARTIC PCR primer conditions. Applicants performed a catch-up/rehybridization PCR under the following conditions: 98° C. for 30s, 95° C. for 15s then 65° C. for 5 min (10 cycles), 95° C. for 15s then 80° C. for 30s then 65° C. for 5 min (2 cycles), 95° C. for 15s then 65° C. for 5 min (8 cycles), 4° C. hold (
Applicants further optimized ARTIC PCR by modifying PCR cycle numbers. Extracted RNA from six SARS-CoV-2 positive clinical samples ranging from CT 27-37 were converted to cDNA with Superscript IV and amplified under standard ARTIC PCR reaction components (with Q5 2× MM) modifying the final number of cycles of PCR from 35, 40 and 45 (
Applicants used mock CT 35 cDNA to test the effect of decreased ramp speed on genome recovery and coverage. ARTIC PCR conditions for this experiment were 98° C. for 30 seconds, followed by 40 cycles of 95° C. for 15 seconds and 65° C. for 5 minutes with a cooling and heating ramping speed of 3° C./s. Applicants tested a slow ramp PCR protocol with the ramp speed reduced to 1.5° C./s (
Under standard ARTIC protocol conditions, Applicants ordered lyophilized ARTIC v3 primers from IDT and resuspended in water at 100 μM each. Pool #1 primers consisted of all odd numbered amplicons whereas pool #2 primers consisted of all even numbered amplicons. To generate the 100 μM pool #1 primer stock, Applicants combined 5 μL of each 100 μM pool #1 primer, and repeated this protocol for the even numbered primers to give a 100 μM pool #2 primer stock. Applicants selected a total of 20 amplicons as regions of low coverage from previous sequencing data (Table 4). Low coverage amplicons were present in both pools, with 11 coming from pool #1 and 9 coming from pool #2. For the primer 2× pools, Applicants spiked in primers for the corresponding amplicons at 2× the concentration (20.8 nM final) of the other primers in the pool. For these low coverage primers, Applicants used 10 μL of the 100 μM stock rather than 5 μL. Applicants diluted both the original and 2× primer pools 1:10 in nuclease free water to generate a 10 μM working stock. Applicants then selected 8 samples with varying CT values to determine if selectively increasing primer concentrations reduced amplicon dropout (
The CT normalization experiment was performed by taking four individual clinical samples (CT=18-25) with four randomly selected SDSIs and either not normalizing the cDNA or normalizing to CT 25, 26, or 27 prior to the ARTIC PCR (
Applicants performed a head-to-head comparison of standard Illumina Nextera DNA Flex and Nextera XT (Illumina #FC-131-1096) library construction kits (
Applicants optimized Illumina DNA Flex library construction (Illumina #20018705) construction with the goal of reducing normalization steps, cost and increasing throughput. Applicants scaled down (0.5×) Illumina DNA Flex throughout the standard Illumina sequencing protocol, also scaling down sample input for a total of 50 ng (25 ng from each primer pool). Due to the CT normalization step, applicants removed the pre-DNA Flex DNA concentration and pooling step. Applicants used 1-2 μL of post ARTIC PCR amplicon as input into the scaled down DNA Flex library construction and performed post library construction quantification and pooling with more uniform library size and concentration, further reducing time and cost of pooling libraries for sequencing. This protocol was used for all method development experiments, the cluster investigation, and a portion of the batch data generated from both the Broad Institute and JAX.
To determine an optimal concentration for SDSIs in ARTIC SARS-CoV-2 sequencing, applicants diluted SDSI 49 to 0.6, 6, 60, and 600 copies/μL (1, 0.1, 0.01, and 0.001fM); 1 μL of SDSI 49 was added to 5 μL of cDNA, to be split to 2×3 μL for each ARTIC pool (
Full protocol details can be found here: benchling.com/s/prt-R95g0tCxKOeCAqn8lAk3 (
The batch data from the Broad Institute was generated using SDSI+AmpSeq with minor modifications (
The GC percent for each SDSIs and percent SDSI reads over total reads correlation for SDSI (2-48) was performed with the samples sequenced at the Broad Institute (N=2,903) (
Data generated at Jackson Laboratory (JAX) used two different protocols publicly available here: github.com/tewhey-lab/SARS-CoV-2-Consensus (
Of note, the SDSIs (used at the lowest recommended concentration of 6e2 copies/uL) were reliably detected in the samples sequenced at JAX. This reliable detection however is also dependent on the sequencing depth used by the institution.
For +/−SDSI experiments testing impact on recovery of viral genomes, fourteen clinical samples spanning a range of CTs (CT=17.6-30) were selected (
Statistical analysis for the plus/minus SDSI experiment involved analysis of the mean coverage for all 98 amplicons for the full sample set with a two-tailed Mann Whitney t-test and multiple comparison two-stage step-up Benjamini, Krieger, and Yekutieli test with FDR set to 5%. All 98 amplicons were found to be not significantly different (p-value >0.05) between the plus and minus SDSI group. Samples were also separated into three CT bins (CT<27 (n=4), 27-29 (n=6), CT>30 (n=4)) and this test repeated for each CT bin. This analysis also revealed that there was no significant difference (p-value >0.05) in the mean coverage across any amplicon for any CT bin.
The intentional contamination experiment used SDSI 87 and SDSI 94 (SDSI 87: SDSI 94). The SDSIs were mixed at five different proportions (100:0, 75:25, 50:50, 25:75, and 0:100) (
Applicants ordered SDSI oligos with minor modifications to enable in-vitro transcription of RNAs (including a T7 promoter upstream of the SDSI amplicon, as well as 17 bps of constant sequence within the primer region) (Twist Bioscience) (sequences in attached Sup Data File 1). For two SDSIs (SDSI 1 and SDSI 4) applicants in-vitro transcribed RNA using a T7 transcription kit (NEB E2050), quantified by RNA screen tape (Agilent 5067-5579 and 5067-5580), then diluted in water to 10fM (6,000 copies/μL), 1fM (600 copies/μL), 100 aM (60 copies/μL), and 10 aM (6 copies/μL). Applicants added 1 μL of SRSI at each concentration directly to 5 μL of RNA from two patient samples with high and intermediate viral loads, respectively, and prepared sequencing libraries using the SDSI+AmpSeq protocol (without the SDSI addition step at the cDNA stage). For the sample with a high viral load, applicants performed a dilution at the cDNA stage (diluting 32-fold for a mock Ct of 25 rather than 20). Reads mapping to unique SDSI sequences and SARS-CoV-2 were quantified using the align_and_count_multiple_report and assemble_refbased wdls respectively, and % SDSI/total reads was reported (
Applicants analyzed sequencing data on the Terra platform (app.terra.bio) using viral-ngs 2.1.28 with workflows that are publicly available on the Dockstore Tool Repository Service (dockstore.org/organizations/BroadInstitute/collections/pgs). Samples were demultiplexed using the demux_plus workflow with a spike in database file for the SDSIs. Applicants performed any separate analyses to quantify read counts, including those for SDSIs, with the align_and_count_multiple_report workflow with the relevant database. For most analyses involving direct comparisons between samples, applicants performed downsampling to the lowest number of reads passing filter with the downsample workflow. Applicants performed assembly using the assemble_refbased workflow to the following reference fasta: www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta. Applicants used iVar version 1.2.1 for primer trimming on all samples followed by assembly with minimap2 set to a minimum coverage of either 3, 10, or 20, skipping deduplication procedures. The computational pipeline for all samples sequenced at JAX is publicly available at the following: github.com/tewhey-lab/SARS-CoV-2-Consensus.
Samples from the batch data were subset in the following way for analysis. All samples with a present SDSI were used for the percent of SDSI reads out of the sum of all SDSI reads analysis (JAX: N=3,838, Broad: N=2,903). Samples with known experimental contamination errors or where the dominant (>50%) SDSI was not the correct SDSI were removed. For the percent of SDSI reads over the total of all sequenced reads analysis (JAX: N=3,093, Broad: N=2,670), non-template controls (waters) and clinical samples with no detectable viral load (CT>40 or not detected via qPCR as described above) were removed from analysis.
Metagenomic sequencing data and genome assemblies used for the comparison of amplicon-based sequencing were prepared, sequenced, analyzed as described previously,21 and the data are publicly available at NCBI's GenBank and SRA databases under BioProject PRJNA622837. Applicants prepared amplicon sequencing libraries from the sample RNA extract following the SDSI+AmpSeq protocol (
Applicants received NP swab samples in UTM and extracted RNA from 200 μL of biosample as previously described21. Applicants prepared amplicon sequencing libraries as described above and analyzed them as detailed in the methods below. A pairwise distance was calculated between all partial genomes (>80% complete), excluding gaps, to determine whether samples were likely to be the result of nosocomial transmission (
For phylogenetic tree reconstruction applicants placed the suspected nosocomial cluster in a broader genomic context by performing a subsampling of the genome sequences available in GISAID (as of Jan. 26, 2021) (
Data analysis and graphing was performed using R Statistical Software (version 1.3.959; R Foundation for Statistical Computing, Vienna, Austria), GraphPad PRISM (version 9.0.2; GraphPad Software, La Jolla Calif. USA, www.graphpad.com) and Python (version 3.7). Applicants created original figures using BioRender (BioRender.com).
Viral genomes were processed using the Terra platform (app.terra.bio) using viral-ngs 2.1.1 with workflows that are publicly available on the Dockstore Tool Repository Service (dockstore.org/organizations/BroadInstitute/collections/pgs). Downstream analyses were performed using Geneious or standard R packages. Custom scripts used to generate figures are available upon request.
Sequences and genome assembly data are publicly available on NCBI's Genbank and SRA databases under BioProject PRJNA622837. GenBank accessions for SARS-CoV-2 genomes newly reported in this study are MW454553-MW454562.
TCTCCTTCTTAGCTTCGTGAGAAC (SEQ ID NO: 391)
CTTGGTCGTCTACTACATGATGTG (SEQ ID NO: 392)
mesorhizobium;
neorhizobium
mesorhizobium;
neorhizobium;
rhizobium;
neorhizobium;
aminobacter;
sinorhizobium;
shinella;
Various modifications and variations of the described methods, pharmaceutical compositions, and kits of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific embodiments, it will be understood that it is capable of further modifications and that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention that are obvious to those skilled in the art are intended to be within the scope of the invention. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure come within known customary practice within the art to which the invention pertains and may be applied to the essential features herein before set forth.
This application claims the benefit of U.S. Provisional Application Nos. 63/155,258, filed Mar. 1, 2021, and 63/273,117, filed Oct. 10, 2021. The entire contents of the above-identified applications are hereby fully incorporated herein by reference.
This invention was made with government support under Grant Nos. AI110818, AI147868, HG010669, and CK000490 awarded by the National Institutes of Health, Grant No. 223-101-8101 awarded by the United States Food and Drug Administration, and Grant No. 75D30120009605 awarded by the Centers for Diseases Control. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63155258 | Mar 2021 | US | |
63273117 | Oct 2021 | US |