Nucleic acid sequencing can be used to evaluate biological samples for one or more indicia of disease. For example, nucleic acid sequencing can be used to determine whether a patient sample contains one more genomic mutations associated with a disease or disorder, or to interrogate a patient sample for the presence of one or more sequences indicative of an infection (e.g., a viral, bacterial, or other microbial infection).
In order to process many samples efficiently, nucleic acid sequencing is often performed in multiplexed sequencing reactions that allow nucleic acid templates obtained from many different samples (e.g., from different patients) to be sequenced together in the same reaction. In a typical multiplexed reaction, nucleic acids from different samples are tagged by attaching a sample-specific barcode to the nucleic acids prior to combining them for sequencing. The resulting sequencing data contains many different sequences having different barcodes. An initial step in the sequence analysis can involve identifying the barcodes associated with the different sequences in order to match the sequences to the samples they were obtained from. Barcode misidentification can be a source of error that leads to incorrect or inconclusive diagnosis or disease detection. Accordingly, new methods of identifying nucleic acids having a particular barcode are needed.
Methods and systems of the application are useful to identify nucleic acid barcode sequences in data obtained from multiplexed sequencing reactions. The sequencing data can be obtained from any sequencing platform, for example using any sequencing protocol that involves adding barcodes to different nucleic acids (e.g., from different samples) and combining the barcoded nucleic acids in a common sequencing reaction. The inventors have discovered a reliable and robust method of detecting barcodes that involves generating an alignment between a target nucleic acid and a reference nucleic acid prior to scoring the aligned target nucleic acid against a scoring region of the reference nucleic acid that includes, in some embodiments, a particular barcode sequence and flanking nucleotides from fixed context sequences (e.g., primer sequences). Accordingly, in some aspects, the disclosure provides a method of determining whether a target nucleic acid (e.g., a target nucleic acid in a multiplexed sample) comprises a particular barcode sequence.
In some aspects, the disclosure provides a method comprising using at least one computer hardware processor to perform:
Further aspects of the disclosure provide systems for performing any of the methods described herein. For example, in some embodiments, a system comprises at least one computer hardware processor; and at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform:
Still further aspects of the disclosure provide at least one non-transitory computer readable storage medium storing processor executable instructions for performing any of the methods described herein. For example, in some embodiments, at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by at least one computer hardware processor, causes the at least one computer hardware processor to perform:
In some embodiments, the reference nucleic acid further comprises a second context sequence and the scoring region further comprises no more than a second threshold of nucleotides of the second context sequence. In some embodiments, prior to generating the alignment in step (i), generating an initial alignment between the at least a segment of the target nucleic acid and an initial region of the reference nucleic acid that contains at least the barcode sequence and the first context sequence, wherein generating the alignment in step (i) is performed based on the initial alignment, and wherein the segment of the reference nucleic acid is the scoring region of the reference nucleic acid.
In some aspects, the disclosure provides a method comprising using at least one computer hardware processor to perform:
In some aspects, a system comprises at least one computer hardware processor; and at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform:
In some aspects, at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by at least one computer hardware processor, causes the at least one computer hardware processor to perform:
In some embodiments, each of the plurality of reference nucleic acids further comprises a second context sequence and the first scoring region further comprises no more than a second threshold of nucleotides of the second context sequence. In some embodiments, prior to generating the plurality of alignments in step (i), generating a plurality of initial alignments between the at least a segment of the target nucleic acid and an initial region of each of the reference nucleic acids that contains at least the barcode sequence and the first context sequence, wherein generating the plurality of alignments in step (i) is performed based on the plurality of initial alignments, and wherein the segment of the first reference nucleic acid is the first scoring region of the reference nucleic acid. In some embodiments, each of the plurality of reference nucleic acids comprises a respective barcode sequence having a different and unique nucleotide sequence. In some embodiments, the plurality of reference nucleic acids comprises at least 8, 16, 32, 64, 96, 192, 288, 384, or 480 reference nucleic acids.
Some aspects of the disclosure provide a method comprising using at least one computer hardware processor to perform:
In some aspects, a system comprises at least one computer hardware processor; and at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform:
In some aspects, at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by at least one computer hardware processor, causes the at least one computer hardware processor to perform:
In some embodiments, the reference nucleic acid further comprises a second context sequence and the scoring region further comprises no more than a second threshold of nucleotides of the second context sequence. In some embodiments, prior to generating the plurality of alignments in step (i), generating a plurality of initial alignments between each of the plurality of target nucleic acids and an initial region of the reference nucleic acid that contains at least the barcode sequence and the first context sequence, wherein generating the plurality of alignments in step (i) is performed based on the plurality of initial alignments, and wherein the segment of the reference nucleic acid is the scoring region of the reference nucleic acid. In some embodiments, the plurality of target nucleic acids comprises at least 8, 16, 32, 64, 96, 192, 288, 384, or 480 target nucleic acids, and wherein each target nucleic acid comprises a discrete sequence or is from a discrete human patient.
In some aspects, the disclosure provides a method comprising using at least one computer hardware processor to perform:
In some embodiments, the reference nucleic acid further comprises a second context sequence.
In some embodiments, the method further comprises obtaining sequencing data from the plurality of target nucleic acids prior to step (i).
The segment of the reference nucleic acid or the segment of each of the plurality of reference nucleic acids may comprise the barcode sequence, at least a portion of the first context sequence, and/or at least a portion of the second context sequence. In some embodiments, the length of the segment of the reference nucleic acid or the segment of each of the plurality of reference nucleic acids is 25-50, 50-150, 100-200, 150-300, or 250-500 nucleotides. In some embodiments, the length of the barcode sequence is 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15-20, or 20-25 nucleotides. In some embodiments, the length of the first context sequence is 5-10, 10-15, 15-20, 20-25, or 25-50 nucleotides. In some embodiments, the length of the second context sequence is 5-10, 10-15, 15-20, 20-25, or 25-50 nucleotides. In some embodiments, the first threshold number is 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10. In some embodiments, the second threshold number is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10. In some embodiments, the ratio of the first threshold number relative to the length of the barcode sequence is less than or equal to 1:1, 1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, or 1:10. In some embodiments, the ratio of the second threshold number relative to the length of the barcode sequence is less than or equal to 1:1, 1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, or 1:10.
In some embodiments, the at least one and no more than a first threshold number of nucleotides of the first context sequence in the scoring region are contiguous with the barcode sequence. In some embodiments, the no more than a second threshold number of nucleotides of the second context sequence in the scoring region are contiguous with the barcode sequence.
In some embodiments, the scoring region comprises 1-10 nucleotides of the first context sequence and 0-10 nucleotides of the second context sequence. In some embodiments, the scoring region comprises one nucleotide of the first context sequence and one nucleotide of the second context sequence.
In some embodiments, generating an alignment comprises generating data encoding an association between (a) the at least a segment of the target nucleic acid and the at least a segment of the reference nucleic acid; (b) the at least a segment of the target nucleic acids and the at least a segment of each of the plurality of reference nucleic acids; or (c) the at least a segment of each of the plurality of target nucleic acids and the at least a segment of the reference nucleic acid.
Determining the sequence similarity may comprise determining a score indicative of how many nucleotides of the target nucleic acid are aligned to similar nucleotides in the scoring region of the reference nucleic acid. In some embodiments, determining the sequence similarity comprises determining a percentage of nucleotides of the target nucleic acid that are aligned to similar nucleotides in the scoring region of the reference nucleic acid. In some embodiments, determining the sequence similarity comprises determining a score indicative of how many nucleotides of the target nucleic acid are aligned to identical nucleotides in the scoring region of the reference nucleic acid. In some embodiments, determining the sequence similarity comprises determining the percentage of nucleotides of the target nucleic acid that are aligned to identical nucleotides in the scoring region of the reference nucleic acid.
In some embodiments, barcodes are used in a combinatorial fashion, wherein more than one barcode is used to identify the origin. For example, the use of two instances of 96 barcodes in a combination provides 9216 identifiers, and the use of two instances of 384 barcodes provides 147456 identifiers.
The target nucleic acid or plurality of target nucleic acids may be amplified prior to step (i) of the method (e.g., using loop-mediated isothermal amplification (LAMP), polymerase chain reaction (PCR), multiple displacement amplification, rolling circle amplification (RCA), or ligase chain reaction). The amplification step may carried out to amplify RNA nucleic acid, such as RT-LAMP. LAMP and RT-LAMP methods of amplification are disclosed in WO01/77317, WO02/24902 and WO01/34790, hereby incorporated by reference in their entirety.
The target nucleic acid or at least one of the plurality of target nucleic acids may be from a human or veterinary patient. In some embodiments, the target nucleic acid or at least one of the plurality of target nucleic acids is indicative of disease or a genetic trait or marker. In some embodiments, identification of the barcode sequence in the target nucleic acid indicates that the patient associated with that barcode has or has had an infection (e.g., a viral or bacterial infection). In some embodiments, an infection is a SARS-CoV-2 infection. The target nucleic acid may comprise at least a segment of a gene associated with a SARS-CoV-2 infection (e.g., a SARS-CoV-2 ORF1a, SARS-CoV-2 envelope, or SARS-CoV-2 nucleocapsid gene). The origin nucleic acid may be derived from plants, animals, fungi, protists, archaea, or bacteria. The origin nucleic acid may be viral and comprise RNA.
In some embodiments, the method further comprises determining that the patient associated with the barcode sequence does not have an infection when a nucleic acid containing the barcode sequence is not detected.
The sequencing data for the target nucleic acid or plurality of nucleic acids may be obtained from measurement of a nucleic acid or plurality of nucleic acids using a variety of different sequencing methods, such as single molecule sequencing, sequencing by synthesis, or pyrosequencing. The detection means may be electrical or optical. Examples of single molecule sequencing include nanopore sequencing, and sequencing using a zero-mode waveguide such as SMRT sequencing using devices developed by Pacific Biosciences of California Inc., such as disclosed in WO2007/002893 and WO2009/120372. Examples of nanopore sequencing devices are disclosed in WO2015/055981, WO2014/064443, WO2017/149316, and, WO2019/002893, WO2015/110813 and WO2014/135838, hereby incorporated by reference in their entirety. Examples of sequencing by synthesis include ion semiconductor sequencing developed by Ion Torrent such as disclosed in WO2009/158006, sequencing based on fluorophore-labelled dNTPs with reversible terminator elements as developed by Illumina such as disclosed in WO00/18957, semiconductor chip-based single-molecule sequencing technology as developed by Roswell Technologies such as disclosed in WO16/210386 and sequencing by synthesis methods as developed by Genia Technologies, such as disclosed in WO2015/148402.
In some embodiments, the target nucleic acid and/or plurality of nucleic acids are 1 kilobase or longer.
Some aspects of the disclosure provide a kit comprising a plurality of nucleic acids, wherein each of the plurality comprises a respective barcode having fewer than ten nucleotides and at least one fixed context sequence. In some embodiments, each of the plurality comprises one fixed context sequence on each side of the barcode. In some embodiments, each of the plurality further comprises a primer sequence, and wherein the primer sequence is complementary to a segment of a target nucleic acid. In some embodiments, the at least one fixed context sequence comprises at least a part of the primer sequence. In some embodiments, the kit further comprises a polymerase.
Described herein are methods, kits, systems, and computer readable storage medium storing processor executable instructions for detection of nucleic acids (e.g., a target nucleic acid) having a particular barcode sequence. The inventors have discovered a novel method of determining whether a target nucleic acid comprises a particular barcode sequence that can be performed rapidly and with a high degree of accuracy (e.g., identification of true positives). This method, in some embodiments, involves the use of at least one sequence alignment between a target nucleic acid (e.g., at least a segment of a target nucleic acid) followed by a determination of sequence similarity between a scoring region of the reference nucleic acid and a corresponding segment of the target nucleic acid to enable determination of whether the target nucleic acid comprises a particular barcode, wherein the scoring region includes the particular barcode sequence and flanking nucleotides belonging to at least one context sequence (e.g., fixed context sequence). In some embodiments, the method is used to identify a target nucleic acid having a particular barcode in order to determine whether a subject (e.g., a human patient) with whom the particular barcode is associated has a disease or infection (e.g., a SARS-CoV-2 infection).
The methods of the disclosure involve complex computations, namely generating sequence alignments and determining sequence similarities between two nucleic acid segments, that necessitate the use of a system (e.g., a computer system as described by
Furthermore, the methods described herein reduce incorrect assignment of barcodes, particularly relative to methods of assigning barcodes that were known in the art. Incorrect assignment may be caused by sequencing errors, spurious alignments, alignment artifacts, or other issues either individually or in combination. The methods described address spurious alignments and alignment artifacts around edges of barcodes in the presence of sequencing errors.
Employing barcode identification techniques described in this application also provides an improvement to sequencing technology and computer technology. Sequencing data that is correctly assigned to an origin-specific sample (e.g., a particular patient sample) by correctly identifying a barcode sequence reduces or eliminates errors in downstream applications (e.g., identifying the presence of one or more indicia of a infection, identifying one or more biomarkers indicative of a disease or condition, recommending and/or administering an appropriate therapy to a patient, etc.). Also, correctly identifying barcode sequences can prevent computationally expensive processes from being executed by avoiding unnecessary interpretation and analysis of complex sequencing data that is associated with an incorrect sample source. This can reduce or eliminate wasteful use of computing resources, saving processing power, memory, and networking resources (which is an improvement to computing technology in addition to being an improvement to sequencing technology). Reducing errors in barcode identification will also reduce waste of resources at a laboratory that processes multiple samples, by freeing up equipment for processing biological samples that are correctly associated with the sample sources and avoiding duplicating and/or repeating experimental analysis for samples that produced incorrect or inconclusive results due to errors in barcode identification. In addition, sequence data for which the source is correctly identified can be useful to select more effective therapies for a patient, improve ability to determine whether one or more cancer therapies will be effective if administered to the patient, improve the ability to identify clinical trials in which the subject may participate, and/or improvements to numerous other prognostic, diagnostic, and clinical applications.
As shown in
As shown in
As shown in
In some embodiments, generating an alignment comprises generating data encoding an association between segment of two nucleic acids (e.g., a target nucleic acid and a reference nucleic acid). In some embodiments, an alignment between two nucleic acid sequences may include any information indicative of an association between the two nucleic acid sequences. In some embodiments, the information indicative of the association between two sequences may indicate corresponding segments of the two sequences (e.g., by indicating, for a first segment of a first sequence, a second segment of the second sequence to which the first segment corresponds). This may be done in any suitable way. For example, an alignment may comprise information indicating, for a first segment of the first sequence, the position(s) in the second sequence of at least some nucleotides of a second segment that corresponds to the first segment. The positions may be specified in any suitable way (e.g., first and last positions, first position and an offset, all positions, etc.), as aspects of the disclosure described herein are not limited in this respect.
In some embodiments, corresponding segments of two nucleic acid sequences may be identical or, if not identical, may have some similarity. For example, corresponding sequence segments may have the same nucleotides at some (e.g., at least a threshold percentage) or all of the corresponding positions. As another example, associated segments may have complementary nucleotides at some (e.g., at least a threshold percentage) or all of the corresponding positions (e.g., in this context, “G” is a complementary nucleotide to a “C” and an “A” is a complementary nucleotide to a “T”).
In some embodiments, generating the alignment comprises using a scoring function based on expected properties of the sequence data. In some embodiments, the expected properties of the sequence data comprise features corresponding to platform specific error modalities. In some embodiments, the expected properties of the sequence data comprise features corresponding to the variations and/or distribution and/or positions of bases within the expected barcode sequences.
In some embodiments, an alignment between two nucleotide sequences may be stored in any suitable non-transitory computer-readable storage medium (e.g., a volatile memory or a non-volatile memory), using any suitable data structure(s), and in any suitable format, as aspects of the disclosure described herein are not limited in this respect.
In some embodiments, generating an alignment between two nucleotide sequences may be performed one or more sequence alignment algorithms. In some embodiments, a dynamic programming-based sequence alignment algorithm may be used. Non-limiting examples of dynamic programming-based sequence alignment algorithms include the Needleman-Wunsch algorithm (e.g., as described in Needleman, Saul B. & Wunsch, Christian D. (1970). “A general method applicable to the search for similarities in the amino acid sequence of two proteins”. Journal of Molecular Biology. 48 (3): 443-53. doi:10.1016/0022-2836(70)90057-4. PMID 5420325, which is incorporated in its entirety herein) and the Smith-Waterman algorithm (e.g., as described in Smith, Temple F. & Waterman, Michael S. (1981). “Identification of Common Molecular Subsequences” (PDF). Journal of Molecular Biology. 147 (1): 195-197. CiteSeerX 10.1.1.63.2897. doi:10.1016/0022-2836(81)90087-5. PMID 7265238, which is incorporated by reference in its entirety herein). However, any other suitable sequence alignment algorithm(s) may be used (e.g., FASTA, BLAST, brute force, dot-matrix alignments, etc.), as aspects of the technology described herein are not limited in this respect.
In some embodiments, determining a sequence similarity comprises generating data encoding the sequence similarity (e.g., sequence identity) between corresponding segments of two nucleic acids (e.g., a scoring region of a reference nucleic acid and a corresponding segment of a target nucleic acid). In some embodiments, sequence similarity between corresponding segments of two nucleic acid sequences is determined based on the presence of identical nucleotides at some (e.g., at least a threshold percentage) or all of the corresponding positions of the two nucleic acid sequences. In some embodiments, sequence similarity between corresponding segments of two nucleic acid sequences is determined based on the presence of purines (e.g., adenine or guanine) at some (e.g., at least a threshold percentage) or all of the corresponding positions. In some embodiments, sequence similarity between corresponding segments of two nucleic acid sequences is determined based on the presence of pyrimidines (e.g., thymine or cytosine) at some (e.g., at least a threshold percentage) or all of the corresponding positions.
In some embodiments, determining a sequence similarity involves determining a percentage of nucleotides of a segment of first nucleic acid (e.g., target nucleic acid) that are aligned to similar nucleotides (e.g., identical nucleotides) in a segment of a second nucleic acid (e.g., the scoring region of the reference nucleic acid). In some embodiments, the percentage of nucleotides in the segment of the first nucleic acid that are aligned to similar nucleotides in the segment of the second nucleic acid is at least 50%, 60%, 70%, 80%, 90%, 95%, or 99%.
In some embodiments, determining a sequence similarity comprises determining a score indicative of how many nucleotides of a segment of first nucleic acid (e.g., target nucleic acid) are aligned to similar nucleotides (e.g., identical nucleotides) in a segment of a second nucleic acid (e.g., the scoring region of the reference nucleic acid). In some embodiments, the number of nucleotides in the segment of the first nucleic acid that are aligned to similar nucleotides in the segment of the second nucleic acid is at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20. In some embodiments, the number of nucleotides in the segment of the first nucleic acid that are aligned to similar nucleotides in the segment of the second nucleic acid is 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20.
In some embodiments, determining the sequence similarity comprises determining a score indicative of how many nucleotides of a segment of first nucleic acid (e.g., target nucleic acid) are not aligned to a segment of a second nucleic acid (e.g., the scoring region of the reference nucleic acid). In some embodiments, determining the sequence similarity comprises determining a score indicative of the number of insertions and deletions in the alignment between the at least a segment of the segment of first nucleic acid (e.g., target nucleic acid) and the segment of a second nucleic acid (e.g., the scoring region of the reference nucleic acid). In some embodiments, determining the sequence similarity comprises determining the edit distance between the segment of first nucleic acid (e.g., target nucleic acid) and the segment of a second nucleic acid (e.g., the scoring region of the reference nucleic acid). In some embodiments, determining the sequence similarity comprises determining an alignment score between the segment of first nucleic acid (e.g., target nucleic acid) and the segment of a second nucleic acid (e.g., the scoring region of the reference nucleic acid) using a scoring function based on expected properties of the sequence data. In some embodiments, the expected properties of the sequence data comprise features corresponding to platform specific error modalities. In some embodiments, the expected properties of the sequence data comprise features corresponding to the variations and/or distribution and/or positions of bases within the expected barcode sequences.
In some embodiments, determining a sequence similarity in the context of the methods described herein involves determining a sequence similarity between a scoring region of a reference nucleic acid and a corresponding segment of a target nucleic acid. A scoring region of a reference nucleic acid may comprise at least a portion of barcode sequence and at least one and no more than a first threshold number of nucleotides of a first context sequence. In some embodiments, the scoring region further comprises no more than a second threshold of nucleotides of a second context sequence.
In some embodiments, determining a sequence similarity comprises using a scoring function based on expected properties of the sequence data. In some embodiments, the expected properties of the sequence data comprise features corresponding to platform specific error modalities. In some embodiments, the expected properties of the sequence data comprise features corresponding to the variations and/or distribution and/or positions of bases within the expected barcode sequences.
A barcode sequence may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 nucleotides in length. In some embodiments, a barcode sequence is 4-25, 4-20, 4-15, 4-10, 5-25, 5-20, 5-25, 5-10, 10-25, 10-20, 10-15, 15-25, or 20-25 nucleotides in length. In some embodiments, a barcode sequence is 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 nucleotides in length. In some embodiments, a barcode sequence is 4-25, 4-20, 4-15, 4-10, 5-25, 5-20, 5-25, 5-10, 10-25, 10-20, 10-15, 15-25, or 20-25 nucleotides in length. In some embodiments, a scoring region comprises the entire length of the barcode sequence. In some embodiments, a scoring region comprises 50-75%, 50-100%, 60-80%, 70-100%, or 80-95% of the barcode sequence. In some embodiments, a scoring region comprises a contiguous portion of a barcode sequence. In some embodiments, a scoring region comprises a non-contiguous portion of a barcode sequence.
A context sequence (the first and/or second context sequences) may be 5-10, 10-15, 15-20, 20-25, 20-50, 30-60, 25-50, 30-75, 50-75, 50-100, or 75-150 nucleotides in length. The first threshold number may be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, or 25. The second threshold number is may be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, or 25.
In some embodiments, the ratio of the first threshold number relative to the length of the barcode sequence is less than 1:1, 1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:12, 1:15, or 1:20. In some embodiments, the ratio of the first threshold number relative to the length of the barcode sequence is equal to about 1:1, 1:2, 2:3, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8 1:9, 1:10, 1:12, 1:15, or 1:20. In some embodiments, the ratio of the second threshold number relative to the length of the barcode sequence is less than 1:1, 1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:12, 1:15, or 1:20. In some embodiments, the ratio of the second threshold number relative to the length of the barcode sequence is equal to about 1:1, 1:2, 2:3, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8 1:9, 1:10, 1:12, 1:15, or 1:20. In some embodiments, the ratio of the first threshold number relative to the length of the barcode sequence is 1:10; and the ratio of the second threshold number relative to the length of the barcode sequence is 1:10. For example, in some embodiments, if the length of the barcode is 10 nucleotides, then the first threshold number may be 1 (i.e., ratio of the first threshold number relative to the length of the barcode sequence is 1:10).
In some embodiments, the at least one and no more than a first threshold number of nucleotides of the first context sequence in the scoring region are contiguous with the barcode sequence. In some embodiments, the no more than a second threshold number of nucleotides of the second context sequence in the scoring region are contiguous with the barcode sequence. In some embodiments, the at least one and no more than a first threshold number of nucleotides of the first context sequence in the scoring region are not contiguous with the barcode sequence. In some embodiments, the no more than a second threshold number of nucleotides of the second context sequence in the scoring region are not contiguous with the barcode sequence.
In some embodiments, when the first threshold number is two or more, the two or more nucleotides are contiguous with one another. In some embodiments, when the second threshold number is two or more, the two or more nucleotides are contiguous with one another. In some embodiments, when the first threshold number is two or more, the two or more nucleotides are not contiguous with one another. In some embodiments, when the second threshold number is two or more, the two or more nucleotides are not contiguous with one another.
In some embodiments, a scoring region comprises 1-10, 1-5, 5-10, 5-15, or 5-20 nucleotides of a first context sequence. In some embodiments, a scoring region comprises 0-10, 0-5, 5-10, 5-15, or 5-20 nucleotides of the second context sequence. In some embodiments, a scoring region comprises one, two, three, or four nucleotide(s) of a first context sequence and one, two, three, or four nucleotide(s) of a second context sequence.
A barcode sequence is a variable nucleic acid sequence that is origin- and/or sample-specific. A barcode sequence may be used to uniquely tag or link a target nucleic acid to a specific subject (e.g., a human or veterinary patient).
In some embodiments, a barcode sequence is short (e.g., for chemistry-driven reasons, e.g., ease of synthesis and purification). In some embodiments, the methods described herein utilize a large number of barcode sequences (e.g., more than 2, 5, 10, 15, etc.) in a multiplexed assay to tag or identify a large number of samples. In some embodiments, a barcode sequences are utilized within different contexts. In some embodiments, barcode sequences are utilized within the same context. In some embodiments, a barcode sequence can share nucleotides with surrounding (e.g., contiguous) context sequence.
In some embodiments, a barcode sequence is 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15-20, or 20-25 nucleotides in length. In some embodiments, a barcode sequence comprises less than 10%, 15%, 20%, 25%, 30%, or 50% of the entire number of nucleotides in a target sequence.
In some embodiments, a multiplexed sample comprising more than one nucleic acid comprising a barcode sequence includes at least 2, 4, 8, 16, 32, 64, 96, 192, 288, 384, 480 or 9216 nucleic acids, each of which comprises a respective and unique barcode sequence (e.g., reference nucleic acids comprising respective and unique barcodes). In some embodiments, there could be single (one barcode per read), or combinatorial (e.g., dual barcoding or two barcodes per read).
A context sequence is generally a fixed (e.g., constant) nucleic acid sequence that is present on a target nucleic acid comprising a barcode sequence. In some embodiments, a fixed context sequence consists of a single nucleic acid sequence that is identical across multiple target nucleic acids, wherein each of the multiple target nucleic acids comprises its own respective barcode sequence. Context sequences are typically larger than the barcode and can be on one or both sides of barcode. In some embodiments, a context sequence is contiguous with the barcode. In some embodiments, a context sequence is not contiguous with the barcode. A context sequence (e.g., a first and/or second context sequences) may be 5-10, 10-15, 15-20, 20-25, 20-50, 30-60, 25-50, 30-75, 50-75, 50-100, or 75-150 nucleotides in length.
In some embodiments, a context sequence comprises at least a part of a primer sequence. In some embodiments, a context sequence comprises at least a part of an amplification primer. In some embodiments, a context sequence comprises at least a part of a sequencing primer. In some embodiments, a context sequence comprises at least a part of a universal primer.
In some embodiments, a context sequence comprises a consistent and identical sequence (e.g., multiple or all nucleic acids in a multiplexed sample comprise identical context sequences). In some embodiments, a context sequence comprises sequence that is consistent in content but variable in length (e.g. a polyA of variable length, e.g., a polyA tail on a transcript). In some embodiments, a context sequence is consistent in length but has a pattern of variation. For example, in some embodiments, a context sequence may always start and/or end with the same nucleotides and/or have consistent nucleotides at specific positions (e.g., the third base is always A).
In some embodiments, the context sequence comprises a technology specific sequence. By way of non-limitative examples, a technology specific sequence may be part of a sequencing adapter that is use to hybridise to a substrate or otherwise immobilise DNA, a leader sequence, an enzyme binding sequence, an enzyme stalling sequence, a registration sequence, a calibration sequence, a ligatable sticky-end, or a transposable element.
A target nucleic acid, in some embodiments, is a nucleic acid that comprises a particular barcode. Accordingly, a target nucleic comprising a particular barcode is a target nucleic acid that is an origin-specific nucleic acid. An origin-specific nucleic acid may be associated with a particular subject (e.g., a human or veterinary patient). By way of example, an origin-specific nucleic acid may be associated with a particular subject, for example, a human or animal subject (for example a human or veterinary patient), or a plant subject. An origin-specific nucleic acid may be associated with an environmental sample. An origin-specific nucleic acid may be derived from a synthetic nucleic acid sequence, for example a synthetic nucleic acid sequence produced as part of an experimental or industrial process or assay, for example a synthetic nucleic acid sequence produced using a DNA data-storage system. An origin-specific nucleic acid may be DNA or RNA.
In some embodiments, a target nucleic acid (e.g., a target nucleic acid comprising a particular barcode) comprises a nucleotide sequence corresponding to a gene, a segment of a gene, and/or a regulatory element of a gene (e.g., a promoter region). The gene may be associated with a disease, genetic trait, or marker. In some embodiments, the gene sequence is associated with a bacterial or viral infection. In some embodiments, the gene sequence is associated with a SARS-CoV-2 infection. In some embodiments, the gene sequence is a SARS-CoV-2 ORF1a, SARS-CoV-2 envelope, or SARS-CoV-2 nucleocapsid gene.
Accordingly, in some embodiments, detection of a target nucleic acid comprising a barcode and a gene, a segment of a gene, and/or a regulatory element of a gene (e.g., a promoter region) associated with a disease, genetic trait, or marker indicates that the subject associated with that particular barcode has or has had an infection (e.g., a viral infection, e.g., a SARS-CoV-2 infection).
In some embodiments, a multiplexed sample comprising a plurality of target nucleic acids includes at least 2, 4, 8, 16, 32, 64, 96, 192, 288, 384, 480 or 9216 target nucleic acids. In some embodiments, each target nucleic acid comprises a respective and unique barcode sequence. In some embodiment, each target nucleic acid comprises a discrete sequence or is from a discrete human patient.
In some embodiments, a target nucleic acid comprises a nucleotide sequence corresponding to a region of variant sequence. In some embodiments, this variant may be a single nucleotide polymorphism, a small insertion or deletion, or a larger structural variant. In some embodiments, identification of target nucleic acids may be used to estimate the proportion of a variant present in a particular sample.
In some embodiments, multiple copies of the target nucleic acid are present in the sequencing read. To increase specificity, where conflicting targets are detected, these reads may be discarded. Alternatively, multiple copies may be used to form a consensus sequence. In some embodiments, the consensus sequence may be used to call one or more sequence variants. In some embodiments sequence variants may be called from multiple sequence alignments to the target region, or from a multiple sequence alignment. In some embodiments, the consensus sequence may be used to further refine target classification, for example by discriminating between similar targets that differ by one or more regions of variant sequence.
In some embodiments, barcodes are used in a combinatorial fashion, wherein more than one barcode is used to identify the origin. For example, the use of two instances of 96 barcodes in a combination provides 9216 identifiers, and the use of two instances of 384 barcodes provides 147456 identifiers.
In some embodiments, barcodes are present at the start of a sequencing read. In some embodiments these barcodes are added prior to, or as part of addition of a sequencing adapter. Where barcodes are expected at the start of a read, to improve specificity, when those barcodes or their contexts are detected within the read, those reads can be discarded.
In some embodiments, barcodes are present at the end of a sequencing read. In some embodiments, barcodes are present within a sequencing read. In some embodiments the same barcode may appear multiple times within a sequencing read. In some embodiments a barcode at the start of a sequencing read and a barcode within the sequencing read are used in a combinatorial fashion.
In embodiments where multiple barcodes are present, the multiple barcodes can be used to further refine barcode calling specificity. By way of non-limitative example, in combinatorial barcoding, where not all combinations are expected, reads where an unexpected combination is detected can be discarded. Where multiple identical barcodes are expected within a read, or at the start and end of reads, reads where conflicting barcodes are detected may be discarded. Alternatively, multiple barcodes, and their flanking sequence, may be used to form a barcode consensus before classification. Alternatively, majority voting could be used to identify the barcode.
In some embodiments, target sequences corresponding to more than one target are present. For example, multiple barcoded primers may be mixed to target multiple sequences or genes or regions of genes. In some embodiments multiple barcoded primers may share the same barcode for a given origin, with that barcode appearing in a different primer-dependent context.
In some embodiments multiple barcoded primers may have different barcodes for a given origin. In some embodiments one of the multiple targets may be interpreted as an in-assay control to indicate that amplification has occurred correctly and/or that the sample was collected correctly and/or for other control purposes.
Nucleic acid sequencing data of a target nucleic acid comprising a barcode or a plurality of target nucleic acids comprising respective barcodes are often obtained prior to performance of the methods described herein. Sequencing data may be obtained using any method known to a skilled person in the art. In some embodiments, sequencing data are obtained from measurement of a nucleic acid or plurality of nucleic acids using a single molecule sequencing device, a nanopore sequencing device, a zero-mode waveguide, or sequencing by synthesis. In some embodiments, the sequencing data produces nucleic acid reads that are at least 0.5, 0.75, 1, 1.5, 2, 2.5, 3, 4, or 5 kilobases in length. In some embodiments, the sequencing data produces nucleic acid reads that are 0.5-1, 0.75-1.25, 1-1.5, 1.5-2.5, 2-4, 2.5-5, 3-5, 4-6, or 5-7 kilobases in length.
In some embodiments, nanopore sequencing involves the measurement of an electrical current as template nucleic acids pass through each pore on a flow-cell array. This measurement of electrical current can be used to determine the sequence identity of an unknown nucleic acid. Nanopore sequencing, which does not have a fixed run time, can be matched to the data requirements. As a consequence, data analysis can be performed in real time, and results can be returned very rapidly.
In some embodiments, to maintain the speed/rapidity advantages offered by nanopore sequencing, it is beneficial to use a correspondingly rapid method of library preparation to convert amplified nucleic acids into a form that is compatible with sequencing. In some embodiments, a rapid method of library preparation involves the use of a barcoded rapid library prep kit which uses a transposase to convert DNA to a barcoded library that is ready to sequence in approximately 10 minutes. In some embodiments, 96 barcodes are available, allowing prepared samples to be pooled for multiplexed sequencing.
In some embodiment, sequencing data may be obtained from measurement of a nucleic acid or plurality of nucleic acids using a variety of different sequencing methods, such as single molecule sequencing, sequencing by synthesis, or pyrosequencing. The detection means may be electrical or optical. Examples of single molecule sequencing include nanopore sequencing, and sequencing using a zero-mode waveguide such as SMRT sequencing using devices developed by Pacific Biosciences of California Inc., such as disclosed in WO2007/002893 and WO2009/120372. Examples of nanopore sequencing devices are disclosed in WO2015/055981, WO2014/064443, WO2017/149316, and, WO2019/002893, WO2015/110813 and WO2014/135838, hereby incorporated by reference in their entirety. Examples of sequencing by synthesis include ion semiconductor sequencing developed by Ion Torrent such as disclosed in WO2009/158006, sequencing based on fluorophore-labelled dNTPs with reversible terminator elements as developed by Illumina such as disclosed in WO00/18957, semiconductor chip-based single-molecule sequencing technology as developed by Roswell Technologies such as disclosed in WO16/210386 and sequencing by synthesis methods as developed by Genia Technologies, such as disclosed in WO2015/148402.
In some embodiments, a target nucleic acid comprising a barcode or a plurality of target nucleic acids comprising respective barcodes are amplified prior to performance of the methods described herein. Nucleic acids may be amplified using any method known to a skilled person in the art. In some embodiments, nucleic acids are amplified using loop-mediated isothermal amplification (LAMP), polymerase chain reaction, multiple displacement amplification, rolling circle amplification, or ligase chain reaction. In some embodiments, any technique known to a skilled person for adding a barcode to a target nucleic acid may be used. In some embodiments, a barcode is added to a target nucleic acid using chemical ligation or amplification techniques.
LAMP is a method of targeted isothermal amplification which can generate micrograms of product from tens of copies of a segment of a target nucleic acid, within 30 minutes at 65° C. Successful amplification is often inferred from a proxy measurement, such as increased turbidity, a color change or changes in fluorescence. However, although the LAMP reaction itself is very robust, these proxy measurements are less robust and can be affected by substances present in biological samples. It is also not uncommon to see a color change or increase in turbidity in no-template controls, arising from amplification of primer artefacts, which would lead to a false positive call. In some embodiments, a target nucleic acid is amplified using LAMP and then subsequently sequenced (e.g., using nanopore sequencing). On-target amplification events contain sequences that are not present in the primers and can be identified without ambiguity by alignment and subsequent scoring as described herein.
The disclosure further provides a kit for use in a method of the disclosure. In some embodiments, a kit comprises a plurality of nucleic acids, wherein each of the plurality comprises a respective barcode having fewer than ten nucleotides and at least one fixed context sequence. In some embodiments, each of the plurality comprises one fixed context sequence on each side of the barcode. In some embodiments, each of the plurality comprises a primer sequence, wherein the primer sequence is complementary to a segment of a target nucleic acid. The primer sequence may overlap with one of the context sequences in part or in full. In some embodiments, the kit further comprises one or more other reagents or instruments which enable any of the embodiments of the method. Such reagents or instruments include one or more of the following: suitable buffer(s) (aqueous solutions), means to obtain a sample from a subject (such as a vessel or an instrument comprising a needle), means to amplify and/or express polynucleotides, a membrane or voltage or patch clamp apparatus. Reagents may be present in the kit in a dry state such that a fluid sample is used to resuspend the reagents. The kit may also, optionally, comprise instructions to enable the kit to be used in the method of the disclosure. The kit may comprise a magnet or an electromagnet. The kit may, optionally, comprise nucleotides and/or a polymerase. Example polymerases suitable for use in RT-LAMP amplification and PCR include Bst DNA Polymerases and Taq DNA Polymerases, examples of which are available from New England BioLabs Inc.
An illustrative implementation of a computer system 1100 that may be used in connection with any of the embodiments of the technology described herein is shown in
Computing device 1100 may also include a network input/output (I/O) interface 1140 via which the computing device may communicate with other computing devices (e.g., over a network), and may also include one or more user I/O interfaces 1150, via which the computing device may provide output to and receive input from a user. The user I/O interfaces may include devices such as a keyboard, a mouse, a microphone, a display device (e.g., a monitor or touch screen), speakers, a camera, and/or various other types of I/O devices.
The above-described embodiments can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor (e.g., a microprocessor) or collection of processors, whether provided in a single computing device or distributed among multiple computing devices. It should be appreciated that any component or collection of components that perform the functions described above can be generically considered as one or more controllers that control the above-discussed functions. The one or more controllers can be implemented in numerous ways, such as with dedicated hardware, or with general purpose hardware (e.g., one or more processors) that is programmed using microcode or software to perform the functions recited above.
In this respect, it should be appreciated that one implementation of the embodiments described herein comprises at least one computer-readable storage medium (e.g., RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other tangible, non-transitory computer-readable storage medium) encoded with a computer program (i.e., a plurality of executable instructions) that, when executed on one or more processors, performs the above-discussed functions of one or more embodiments. The computer-readable medium may be transportable such that the program stored thereon can be loaded onto any computing device to implement aspects of the techniques discussed herein. In addition, it should be appreciated that the reference to a computer program which, when executed, performs any of the above-discussed functions, is not limited to an application program running on a host computer. Rather, the terms computer program and software are used herein in a generic sense to reference any type of computer code (e.g., application software, firmware, microcode, or any other form of computer instruction) that can be employed to program one or more processors to implement aspects of the techniques discussed herein.
The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of processor-executable instructions that can be employed to program a computer or other processor to implement various aspects of embodiments as discussed above. Additionally, it should be appreciated that according to one aspect, one or more computer programs that when executed perform methods of the disclosure provided herein need not reside on a single computer or processor, but may be distributed in a modular fashion among different computers or processors to implement various aspects of the disclosure provided herein.
Processor-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, data structures may be stored in one or more non-transitory computer-readable storage media in any suitable form. For simplicity of illustration, data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a non-transitory computer-readable medium that convey relationship between the fields. However, any suitable mechanism may be used to establish relationships among information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationships among data elements.
A series of simulations were performed using a query nucleic acid (e.g., target nucleic acid) comprising a particular barcode, BC06 (TCTGCATCGT (SEQ ID NO: 3)). The data were analysed against a set of 8 barcodes (BC01-BC08). As demonstrated below, the query nucleic acid was aligned against (1) the full length of a reference nucleic acid comprising BC06 and then scored using the barcode of the reference and the corresponding segment of the query to provide a score of 99; (2) the correct BC06 barcode alone and then scored using the barcode to provide a score of 12; and (3) the incorrect BC04 barcode alone and then scored using the barcode to provide a score of 14. Notably, the incorrect barcode BC04 provided a better score than the correct barcode BC06 because of a spurious alignment.
However, when the scoring region was increased to include the barcode sequences and three contiguous nucleotides from context sequences on either side of the barcode, the correct barcode BC06 provided a score of 20 while the incorrect barcode BC04 provided a lower score of 18. This example demonstrates that the use of flanking nucleotides from the context sequence when scoring after an alignment provides discrimination of a correct barcode relative to an incorrect barcode.
A series of simulations were performed using a query nucleic acid (e.g., target nucleic acid) comprising a particular barcode, BC06 (TCTGCATCGT (SEQ ID NO: 3)). The data were analysed against a set of 8 barcodes (BC01-BC08). As demonstrated below, the query nucleic acid was aligned against (1) the full length of a reference nucleic acid comprising BC06 and then scored using the barcode of the reference and the corresponding segment of the query to provide a score of 102; (2) the correct BC06 barcode alone and then scored using the barcode to provide a score of 14; (3) the incorrect BC07 barcode alone and then scored using the barcode to provide a score of 15. Notably, the incorrect barcode BC07, which aligned to a similar location on the query as did BC06 barcode, provided a better score than the correct barcode BC06 because it was able to steal a nucleotide from the right-hand context (underlined twice).
However, when the scoring region was increased to include the barcode sequences and three contiguous nucleotides from context sequences on either side of the barcode, the correct barcode BC06 provided a score of 26 while the incorrect barcode BC07 provided a lower score of 22. This example demonstrates that the use of flanking nucleotides from the context sequence when scoring after an alignment provides discrimination of a correct barcode relative to an incorrect barcode that aligns to a similar location as the correct barcode and is capable of stealing a nucleotide from a context sequence.
A series of simulations were performed using a query nucleic acid (e.g., target nucleic acid) comprising a particular barcode, BC06 (TCTGCATCGT (SEQ ID NO: 3)). In this example, there were 15 errors in the full sequence alignment out of 60 reference bases (25%). However, restriction of scoring to a flanked barcode (the barcode sequence and three contiguous nucleotides from context sequences on either side of the barcode) lowered the error rate such that only 2 errors were captured out of 16 reference bases (12.5%). This sequence would be classified correctly based on focused scoring, but would be discarded as incorrect based on a full scoring with a reasonable score threshold.
In this example, had the flanked barcode region been considered, 4 errors would have been counted in both the correct (underlined once) and incorrect (underlined twice) barcode classifications out of 16 reference bases (25%). This sequence would not be classified with a reasonable scoring threshold. However, because there were only 4 further errors in the full context sequences, the total error rate was 13% which may exceed a reasonable scoring threshold based on full sequences.
Plotting number of correct identifications vs number of incorrect identifications from 1000 simulated examples of fixed context containing BC05. Focused scoring on the expected position of the flanked barcode sequence in the alignment achieved similar benefits to alignment of the flanked sequences. Performance was further improved relative to alignment of flanked sequences, since the focused scoring filters out spurious alignments that occur outside of the expected region (e.g. in the primer sequences). In this example, addition of 1 flanking base was sufficient to fully realise a benefit.
A nucleic acid sequencing read contains two barcode contexts for the AS1 target (a SARS-CoV-2 target), one of which contains the correct barcode (FIP08), the other contains a spurious alignment to an incorrect barcode (FIP02). For the correct hit (top) the addition of a flanking base to the scoring region (FL=1) had no effect on edit distance (ED), however for the incorrect hit (bottom) the addition of flanking bases increased the edit distance. In this particular case the increase in specificity of barcode assignment also led to an increase in sensitivity as this read failed QC due to conflicting barcode assignments at FL=0 but had a unique assignment at FL=1.
SARS-CoV-2 emerged in late 2019 and spread rapidly around the world, causing hundreds of thousands of COVID-19-related deaths. The discovery of the first SARS-CoV-2 genome sequence allowed the development of tests for the presence or absence of viral RNA from biological samples, which provide a way to identify people who are infected by the virus. Although there is some uncertainty about how infectious asymptomatic people are, it is more certain that many people can transmit the virus while being pre-symptomatic, or having mild symptoms. As a consequence, rather than only testing people who show symptoms, it is becoming important to enable frequent and routine screening of large numbers of people who are not presently showing symptoms, to help return to pre-pandemic activities more safely. For wide-scale screening to be worthwhile, it is important to have assays that are high throughput, accurate and very fast. Epidemiological models show that testing frequency and time-to-result are important components of a surveillance system. However, there is little benefit in being able to screen a large number of samples if the results are not made available quickly enough to inform quarantine decisions or contact tracing. In the US, many labs are taking 5-7 days or more to turn tests around.
This example describes a method which combines the rapid target-specific amplification provided by LAMP, a method of transposase-based library preparation, and real-time nanopore sequencing and data analysis. The resulting combination, LamPORE, is rapid, sensitive and highly scalable and here it demonstrated LamPORE's efficacy for detecting the presence or absence of SARS-CoV-2 RNA in clinical samples. The end-to-end procedure, beginning with 96 RNA extracts, and ending with positive and negative calls, can be performed in 115 minutes when sequencing on a MinION or GridION. The number of samples that can be sequenced in parallel can be increased by expanding either the number of LAMP barcodes or the number of ONT rapid barcodes. In these circumstances, it was useful to extend the length of the sequencing run. When using 12 different LAMP barcodes combined with 96 rapid barcodes (=1,152 samples) it was found that 4 hours of MinION sequencing was sufficient. After sequencing it is possible to remove the sample strands from the flowcell with a nuclease flush, and to load a fresh set of samples. Having a larger number of pores per flowcell, the length of the corresponding sequencing run is shorter on the PromethION, but alternatively, it was possible to use larger multiplexes.
The remaining bottleneck in the current end-to-end workflow was that of extracting RNA from the biological samples. Recent publications have indicated that saliva is a suitable source of SARS-CoV-2 RNA in infected patients, and it has been found that following heat-treatment, saliva with spiked-in inactivated SARS-Cov-2 virions can be amplified and sequenced successfully.
LAMP is capable of amplifying several targets simultaneously. LamPORE relies on sequencing, as opposed to a colour change, which raises the possibility of using a single multiplexed LamPORE reaction to detect many different pathogens. In the case of co-infection, it should also be possible to identify which combination of pathogens is present. A LamPORE assay is currently being developed to cover several viral respiratory illnesses including influenza, in multiplex.
Methods
It was shown that performing a multiplexed amplification reaction, in which three separate regions of the SARS-CoV-2 genome were targeted, performed with high sensitivity (to around Ct37 as measured by RT-qPCR). In addition, the inclusion of a fourth primer set, targeting human actin mRNA allowed true negatives to be distinguished from invalid results, where the initial sample was not taken or processed adequately. Suboptimal sampling was suspected to be responsible for false negative results in many SARS-CoV-2 tests (8, 9). Starting with RNA extracted from swabs, results were obtained from a small number of samples in approximately an hour, and from 96 samples in approximately 115 minutes. This assay was simple to scale from a small number of samples to thousands, with greater degrees of multiplexing achievable by increasing the numbers of LAMP barcodes and/or rapid barcodes.
1. Amplification and Library Preparation
Primer sequences for the amplification of three SARS-CoV-2 targets and human actin mRNA were obtained from New England Biolabs and short barcodes were added to the forward inner primers (FIP) as described. Primers were synthesised and HPLC-purified by IDT (Coralville, Iowa). The concentration of actin primers was intentionally lower than for the SARS-CoV-2 primers to prevent amplification of the human target overwhelming any SARS-CoV-2 amplification.
For each FIP barcode, a 10× primer pool was prepared in 400 mM guanidine hydrochloride, containing each oligonucleotide at the appropriate concentration. Reactions were performed in 96-well plates in such a way that each well in a row received the same barcoded FIP primer mix, with different barcoded FIPs being used in the different rows. Each LAMP reaction consisted of 25 μl 2× LAMP Master Mix (NEB E1700), 5 μl 10× primer pool and 20 μl RNA sample (or no-template control). Reactions were incubated at 65° C. for 35 minutes, followed by 80° C. for 5 minutes. Following amplification, reactions were pooled by column, giving 12 pools, each consisting of 8 separate reactions (
Library preparation was performed separately on each of the 12 pools, in a volume of 10 μl per reaction. Each reaction consisted of 6.5 μl nuclease-free water, 1 μl of pooled LAMP product and 2.5 μl of the appropriate rapid barcode (Oxford Nanopore Technologies, SQK-RBK004). Reactions were mixed and spun down, before being incubated at 30° C. for 2 minutes and then 80° C. for 2 minutes. All reactions were then pooled into a single 1.5 ml Eppendorf LoBind tube.
The pooled products were purified using 0.8× AMPure beads, were washed in fresh 80% ethanol and were eluted in 15 μl EB buffer. 11 μl of eluate was transferred to a clean 1.5 ml Eppendorf LoBind tube, along with 1 μl rapid adapter (RAP). Reactions were incubated for 5 minutes at room temperature, before being sequenced on a single MinION flowcell for 1 hour, following the manufacturer's instructions.
2. Data Analysis
In order to call the presence or absence of virus in the sample, the number of reads from each LAMP target may be counted for each sample in the sequencing run. This requires the accurate identification of i) the barcode added during library preparation by the rapid barcoding kit (RBK), ii) the barcode added as part of the FIP primer during the LAMP reaction and iii) the sequence of the LAMP product associated with each target region.
The RBK barcodes are identified using the guppy_barcoder software (version 4.0.11; command line options “--barcode_kits SQK-RBK004--detect_mid_strand_barcodes--min_score_mid_barcodes 40”).
The FIP barcode was detected in a two-step process. Firstly, candidate regions were identified by aligning a sequence consisting of the FIP primer with Ns in place of the barcode sequence against all reads using the VSEARCH tool (11) (version 2.14.2; command line options: “--maxaccepts 0--maxrejects 0--id 0.75--strand both--wordlength 5--minwordmatches 2”). This returned a maximum of 2 candidate regions for each read which were subsequently filtered to remove alignments shorter than 30 nucleotides.
The second step identified the actual barcode sequence within the candidate region. A strategy was selected to maximise discrimination for these relatively short sequences. Aligning and scoring over the whole candidate region reduced discrimination due to the possibility of sequencing errors in the flanking primer regions. Restricting scoring to only the barcode sequence reduced discrimination due to alignment artifacts around the ends of the barcodes. To avoid such alignment artifacts, whilst maintaining discrimination, 1 nucleotide of the flanking primer sequence was added to each barcode before alignment within the candidate region. Each of the expanded FIP barcode sequences was aligned against the candidate region using the edlib package allowing a maximum edit distance of 1.
The LAMP product associated with each read was identified using the same VSEARCH parameters to align the genome/transcript sequence encompassed spanning the F2-B1 primer locations against each read. A valid LAMP product was detected if the alignment length is greater than 80 nucleotides and greater than 80% identity.
The multimeric nature of the LamPORE reads allowed an additional layer of quality control. Each read may only contain sequence from a single LAMP target for a single sample, therefore reads with multiple rapid barcodes, conflicting FIP barcodes or incompatible FIP-product pairings are removed from further consideration. The specific nature of the sequencing analysis allowed non-specific amplification, for example primer artefact, to be measured and excluded. Reads with RBK and FIP classifications, but which fail product classification or contain conflicting product regions, were counted as “unclassified”.
Per-sample results of the assay were returned as either positive, negative, inconclusive, or invalid. The calls were made based on the aggregated read counts for each sample across the various targets (i.e. human actin and the three SARS-CoV-2 target regions) and cutoffs were chosen based on 1 hour of sequencing. An invalid call was returned if <50 total classified reads were obtained from across all targets (including both human actin and SARS-CoV-2). A negative call was returned if a sum of <20 reads were obtained from the three SARS-CoV-2 targets (and >=50 reads in total). An inconclusive call was returned if a sum of >=20 and <50 reads were obtained from the three SARS-CoV-2 targets. A positive call was returned if a sum of >=50 reads was obtained from the three SARS-CoV-2 targets.
iii) ROC and F1 Score Curves
To evaluate the sensitivity and specificity of the assay against the known status of 80 COVID-19 positive clinical RNA samples and a similar number of human RNA-only negatives, receiver operating characteristic (ROC) curves were generated using the metrics.roc_curve function from the scikit-learn package. The sum of read counts across each of the three SARS-CoV-2 targets (AS1, E1, and N2) served as the scoring metric for calling the results positive, negative, inconclusive, or invalid. The ROC curve therefore revealed the sensitivity and specificity of the assay at various thresholding values of that scoring metric. In addition to the curve generated for the SARS-CoV-2 read count sum, curves for read counts were also generated from each individual SARS-CoV-2 target.
The F1 score represents the harmonic mean of the sensitivity and specificity of the assay, defined as 2*[(1−FPR)*TPR]/[(1−FPR)+TPR], where TPR is the true positive rate and FPR is the false positive rate. The read count threshold (>=50 total SARS-CoV-2 target reads) was chosen in order to maximize the F1 score.
Results
An assay that targeted a single locus from the SARS-CoV-2 genome would potentially lack robustness to sequence variants that occur as the virus evolves. To overcome this, three different regions were targeted in the SARS-CoV-2 genome, in a single multiplexed reaction. These are ORF1a and the envelope (E) and nucleocapsid (N) genes, with primer sets AS1 (10), E1 and N2 (14), respectively. In addition, as a control for the quality of the initial sample preparation, RNA extraction, reverse transcription and LAMP amplification, a set of primers were included to amplify the human actin mRNA (14). The primers target either side of a splice junction and do not amplify from genomic DNA. As long as the sample has been taken and prepared correctly, actin RNA may be present in all the swab samples, regardless of their SARS-CoV-2 status, and so this provides a way to differentiate between true negatives and invalid samples.
To assess the inclusivity of the triplex SARS-CoV-2 assay, all primer sequences to the 46,872 human SARS-CoV-2 genomes deposited at GISAID on Jun. 16, 2020 were aligned. Since not all genomes were high coverage or complete, 2,105 sequences belonging to 1,939 samples were excluded from analysis of at least one primer set because they covered fewer than 90% of all bases in that region. Of the 44,933 genomes with sufficient coverage in all three regions, 2,554 (5.68%) genomes and 179 (0.40%) genomes had a mismatch in one or two primer sets respectively, but a full match for the others. Only 2 (0.004451%) genomes had a mismatch in all three primer sets. The primer sets used have a 100% match with most sequences: 97.1% for AS1, 98.7% for E1, and 97.6% for N2. Given the widespread mutations that have been identified in SARS-CoV-2, each primer set had one mismatch for 1.3-2.9% of the strains deposited in GISAID (Table 1). The presence of a single mismatch, however, was unlikely to have a significant impact on the limit of detection, as previously shown in work on MERS-CoV LAMP assays.
In order to assess the potential for cross-reactivity with other viruses, the LAMP primer sequences were aligned against sequences of common viruses as well as coronaviruses related to SARS-CoV-2. Sequence identity was determined by dividing the sum of aligned primer bases by the sum of primer lengths (Table 2).
Bordetella
pertussis (BPP-1)
Candida
albicans (L757)
Haemophilus
influenzae
Legionella
pneumophila
Pseudomonas
aeruginosa
Staphylococcus
epidermidis
Streptococcus
pneumoniae
Streptococcus
pyogenes (AP1)
Streptococcus
salivarius
SARS-CoV, which is closely related to SARS-CoV-2, was the sole virus to have a match against the total sequence length of the SARS-CoV-2 primers greater than the recommended threshold of 80%. The E-gene primer set has a match >90% with SARS-CoV, but the AS1 and N2 primer sets differ significantly, matching at only 44.5% and 74%, respectively. The likelihood of a false positive is low since SARS-CoV is not known to be in active circulation at present. Furthermore, should this situation change, the presence/absence stage of the analysis can be modified to identify positive results that are dependent entirely on amplification of the E-gene primer.
LAMP products contain multiple copies of each ˜150 bp target region joined end-to-end, forming strands of up to approximately 5 kb, with consecutive copies of the target region in alternating orientation (
More than one forward and reverse primer was used in each LAMP reaction at each target region, so the repeating units were not of a uniform length (
Primer artifacts can accumulate during the LAMP reaction, and as a result, the consequence of judging successful amplification by a proxy measurement, such as a colour change or increase in turbidity, can be a false positive call. This is avoided when sequencing is used as a readout: reads are aligned to a reference sequence, and for a read to be considered valid, it may consist of inverted repeats of large stretches of the target region, including target-specific sequences present that do not exist in the primers. Alignments of valid reads were contiguous across the majority of the target region (
iii) FIP Barcode Optimisation
Verification of the FIP barcodes for each target was carried out using a dilution series of the Twist Synthetic RNA Control 2 (Twist Biosciences) for the SARS-CoV-2 loci and total human RNA extracted from GM12878 (Coriell) for the actin control. Template quantities ranged from 20-250 copies per reaction. It was observed that not only does the presence of the barcode influence the sensitivity of the reaction, the sequence of the barcode also affects performance, with some barcoded FIPs working with higher sensitivity than others. The worst-performing barcoded FIPs were excluded and in this way the initial 12 barcodes were reduced to the best-performing 8, all of which were capable of amplifying from 20 copies in a 50 μl LAMP reaction (
To expand the evaluation of the assay's performance, 80 clinical samples, consisting of RNA which had been extracted from nasopharyngeal swabs, were obtained. The samples had been found to be positive for SARS-CoV-2 RNA by RT-qPCR, and spanned a range of Ct values, from Ct=19 for the highest viral load to Ct=38 for the lowest. In the absence of RT-qPCR-verified negative samples, a similar number of reaction negatives using total human RNA were prepared. A sufficient number of sequences corresponding to the actin control fragment were obtained in all negative samples for these to be called as valid, and in 81 out of 85 samples, a negative call was obtained. Read-count results indicate the amplification of targets E1 or N2 in the four positives was due to contamination. Out of the 80 RT-qPCR-verified positives, 79 were called positive in the LamPORE analysis. The false negative corresponded to the lowest Ct sample, Ct=38. The two samples at Ct=37 were called positive (Table 3).
ROC curves generated from 80 COVID-19 positive clinical samples and 85 COVID-19 human RNA negatives show good concordance between SARS-CoV-2 detection via the LamPORE assay and the RT-qPCR verified status, with an area under the curve (AUC) of 0.993 for the metric used in calling the results (sum of SARS-CoV-2 target reads,
This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application U.S. Ser. No. 63/063,178 filed Aug. 7, 2020, the contents of which are incorporated herein by reference. This application contains a Sequence Listing which has been submitted in ASCII format via EFS-Web and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Aug. 4, 2020 is named 0036670103US01-SEQ-MSB and is 3,760 bytes in size.
Number | Date | Country | |
---|---|---|---|
63063178 | Aug 2020 | US |