The present disclosure relates generally to systems and methods for detecting oncogenic pathogenic infections in cancer patients.
The presence of oncogenic pathogen infections account for 10 to 12% of all cancers. For example, consider the case of gastric cancer, which is the third most common cause of cancer death worldwide, with more than 700,000 deaths estimated to have occurred in 2012. See, Ferlay, et al., 2013, “Cancer Incidence and Mortality Worldwide,” IARC CancerBase 11, [Internet]. Lyon, France: International Agency for Research on Cancer. Beyond genetic factors, gastric carcinogenesis is thought to be associated with multiple environmental factors. Among the environmental factors, increasing evidence suggests that a subset of gastric cancers is associated to Epstein-Barr virus (EBV) infection. See, Burke et al., 1990, “Lymphoepithelial carcinoma of the stomach with Epstein-Barr virus demonstrated by polymerase chain reaction,” Mod Pathol. 3:377–380. In fact, recent cancer genome atlas research has provided a molecular classification defining EBV-positive gastric cancer as a specific subtype. See, 2014, “Cancer Genome Atlas Research Network. Comprehensive molecular characterization of gastric adenocarcinoma,” Nature. 513, pp. 202-09.
As such, the presence of such oncogenic pathogens affects the prognosis of the associated cancer. Accordingly, when a subject has a type of cancer that is known to frequently arise in conjunction with an oncogenic pathogen, knowledge of the pathogen status of the subject is important to have because it may change the treatment options of the subject. For example, numerous clinical trials investigating the benefit of radiation or chemotherapy dose reduction for HPV positive head and neck cancers have shown promising results. Additionally, pathogen-associated tumors are more likely to present higher levels of inflammation and immune infiltration, which make them good candidates for immunotherapy.
A drawback with conventional diagnosis is that, in order to determine whether a subject is afflicted with a particular pathogen, a completely independent assay is performed separate and apart from the assays that were used to diagnose a subject with cancer in the first instance, or used to evaluate a stage of the cancer. For example, in the case of EBV, separate laboratory methods such as in situ hybridization (ISH) or polymerase chain reaction (PCR) for resected tissue, biopsy, or blood, or enzyme-linked immunosorbent assay (ELISA) or immunofluorescence assay (IFA) for serum samples is performed to detect the EBV infection. This is unsatisfactory because it increases the expense of diagnosis and, in some instances, where the pathogen test is only run after a type of cancer that is known to be associated with oncogenic pathogen has been diagnosed, delays the development of a treatment plan for the subject until the pathogen assay results have been obtained.
Given the above background, what is needed in the art are improved systems and methods for pathogen detection that directly determine the presence of a given pathogen detection without a requirement for a separate independent assay for the pathogen detection.
Accordingly, improved methods for distinguishing cancers associated with oncogenic pathogen infections that contribute to the cancer pathology and cancers that are not associated with oncogenic pathogen infections are provided. Improved methods are also provided for treating cancer patients based on whether their cancer is associated with an oncogenic pathogen infection. The present disclosure addresses these needs, for example, by providing methods for determining whether a subject is afflicted with an oncogenic pathogen based on sequencing data generated from a biological sample of the subject. In some embodiments, these methods include computational subtraction of human sequence reads prior to alignment of the remaining sequence reads against oncogenic pathogen reference constructs.
One aspect of the present disclosure provides a method of determining whether a subject is afflicted with an oncogenic pathogen. The method includes obtaining sequencing data from a nucleic acid sample isolated from a biological sample of the subject and determining whether each sequence read aligns to a human reference genome. The method then includes determining whether sequence reads that don’t align to the reference human genome align to a reference genome of an oncogenic pathogen. The method also includes, for each respective oncogenic pathogen in a plurality of oncogenic pathogens, tracking the number of sequence reads that (i) fail to align to the human reference genome and (ii) align to the reference genome of the respective oncogenic pathogen, thereby obtaining a sequence read count for each oncogenic pathogen. The method then includes using the sequence read count for each oncogenic pathogen to ascertain whether the subject is afflicted with an oncogenic pathogen.
In some embodiments, the method includes isolating nucleic acids from the biological sample of the subject, and hybridizing the isolated nucleic acids to a probe set including (i) a plurality of nucleic acid probes for a plurality of human genomic loci and (ii) a respective set of nucleic acid probes for genomic loci of each respective oncogenic pathogen in a plurality of oncogenic pathogens.
In some embodiments, determining whether each sequence read aligns to the human reference genome is performed using an index-based alignment algorithm.
In some embodiments the determining, for each respective sequence that does not align to the human reference genome, whether the respective sequence aligns to a reference genome for an oncogenic pathogen is performed by using an index-based alignment algorithm. In some such embodiments, this is further confirmed by performing a competitive alignment against the reference human genome.
In some embodiments, the results of the method are further used to generate a clinical report about the cancer status of the subject. In some embodiments, the clinical report includes information selected from whether the subject is afflicted with cancer, a type of cancer the subject is afflicted with, a primary origin of a cancer the subject is afflicted with, a recommendation for treatment of a cancer the subject is afflicted with, and a prognosis for the subj ect.
In some embodiments, a method is provided for determining whether a subject is afflicted with an oncogenic pathogen by sequencing both DNA and RNA obtained from one or more biological samples from the subject. In some embodiments, the method includes making a first determination of whether the subject is afflicted with an oncogenic pathogen based on the DNA sequencing data, using one or more of the methods disclosed herein, and a second determination of whether the subject is afflicted with an oncogenic pathogen based on the RNA sequencing data, using one or more of the methods disclosed herein, and then combining the first and second determinations to make a final determination of whether the subject is afflicted with an oncogenic pathogen. In some embodiments, the combining includes determining whether both the first determination and the second determination indicate that the subject is afflicted with the oncogenic pathogen and accepting the determination if both indicate that the subject is afflicted with the oncogenic pathogen or rejecting the determination if at least one of the determinations does not indicate that the subject is afflicted with the oncogenic pathogen. In some embodiments, the combining includes determining whether either of the first determination and the second determination indicate that the subject is afflicted with the oncogenic pathogen and accepting the determination if at least one of the determinations indicates that the subject is afflicted with the oncogenic pathogen or rejecting the determination if both of the determinations do not indicate that the subject is afflicted with the oncogenic pathogen. In some embodiments, the first determination and the second determination are each a probability or likelihood that the subject is afflicted with the oncogenic pathogen and the combining includes averaging the probabilities or likelihoods to generate a final probability or likelihood that the subject is afflicted with the oncogenic pathogen.
In some embodiments, a first determination of whether the subject is afflicted with one or more oncogenic pathogens in a first plurality of oncogenic pathogens is made based on DNA sequencing of a biological sample from the subject, according to any of the methods described herein, and a second determination of whether the subject is afflicted with one or more oncogenic pathogens in a second plurality of oncogenic pathogens is made based on RNA sequencing of a biological sample from the subject (e.g., the same biological sample or a different biological sample from the subject), according to any of the methods described herein. In some embodiments, the first plurality of oncogenic pathogens and the second plurality of oncogenic pathogens are the same set of oncogenic pathogens. In some embodiments, the first plurality of oncogenic pathogens and the second plurality of oncogenic pathogens are different sets of oncogenic pathogens. In some embodiments, when the first and second pluralities of oncogenic pathogens are different sets of oncogenic pathogens, there is an overlap between the two sets of oncogenic pathogens. In some embodiments, when the first and second pluralities of oncogenic pathogens are different sets of oncogenic pathogens and there is an overlap in the two sets of oncogenic pathogens, a single determination that the subject is afflicted with an oncogenic pathogen that is part of both sets is sufficient to call the pathogenic infection. In other embodiments, when the first and second pluralities of oncogenic pathogens are different sets of oncogenic pathogens and there is an overlap in the two sets of oncogenic pathogens, a single determination that the subject is afflicted with an oncogenic pathogen that is part of both sets is not sufficient to call the pathogenic infection, but a single determination that the subject is afflicted with a second oncogenic pathogen that is part of only one of the two sets is sufficient to call the second pathogenic infection. In some embodiments, when the first and second pluralities of oncogenic pathogens are different sets of oncogenic pathogens, there is no overlap in the two sets of oncogenic pathogens.
Other embodiments are directed to systems, portable consumer devices, and computer readable media associated with the methods described herein.
As disclosed herein, any embodiment disclosed herein when applicable can be applied to any aspect.
Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
The present disclosure provides systems and methods useful for determining whether a subject is afflicted with an oncogenic pathogen. The present disclosure further provides systems and methods useful for treating cancer patients, based on whether their cancer is associated with an oncogenic pathogen infection or not.
For example, in one aspect, the present disclosure provides systems and methods for determining whether a subject is afflicted with an oncogenic pathogen based on data generated for the classification of a cancer in a subject. As described herein, in some embodiments, the method includes using sequencing data that is generated by probe-based capture of nucleic acids from a biological sample from the subject. Advantageously, employing a single assay for cancer classification and oncogenic pathogen detection decreases the time, capital, and resources needed to provide comprehensive information about the cancer status of a patient. This is in contrast with conventional methods for detecting oncogenic pathogens that require a separate assay solely dedicated to the oncogenic pathogen detection, and which require additional resources beyond those used to classify a subject’s cancer status and/or take additional time to obtain thereby delaying development of a treatment plan.
In some embodiments, the sequence reads are first aligned against a reference human genome and then sequences that do not align to the human genome are aligned against reference sequences, e.g., all or portions of reference pathogenic genomes, of one or more oncogenic pathogens. Advantageously, pre-filtering the sequence reads by removing those that align to the reference human genome greatly decreases the time needed to perform the auxiliary alignments against the pathogenic genomes, particularly when many pathogenic genomes are being sampled.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the description of the invention and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, to the extent that the terms “including,” “includes,” “having,” “has,” “with,” or variants thereof are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”
As used herein, the term “if” may be construed to mean “when” or “upon” or “in response to determining” or “in response to detecting,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event],” depending on the context.
It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first subject could be termed a second subject, and, similarly, a second subject could be termed a first subject, without departing from the scope of the present disclosure. The first subject and the second subject are both subjects, but they are not the same subject. Furthermore, the terms “subject,” “user,” and “patient” are used interchangeably herein.
As used herein, the term “subject” refers to any living or non-living human. In some embodiments, a subject is a male or female of any stage (e.g., a man, a women or a child).
As used herein, the terms “control,” “control sample,” “reference,” “reference sample,” “normal,” and “normal sample” describe a sample from a subject that does not have a particular condition, or is otherwise healthy. In an example, a method as disclosed herein can be performed on a subject having a tumor, where the reference sample is a sample taken from a healthy tissue of the subject. A reference sample can be obtained from the subject, or from a database. The reference can be, e.g., a reference genome that is used to map sequence reads obtained from sequencing a sample from the subject. A reference genome can refer to a haploid or diploid genome to which sequence reads from the biological sample and a constitutional sample can be aligned and compared. An example of constitutional sample can be DNA of white blood cells obtained from the subject. For a haploid genome, there can be only one nucleotide at each locus. For a diploid genome, heterozygous loci can be identified; each heterozygous locus can have two alleles, where either allele can allow a match for alignment to the locus.
As used herein, the term “locus” refers to a position (e.g., a site) within a genome, e.g., on a particular chromosome. In some embodiments, a locus refers to a single nucleotide position within a genome, i.e., on a particular chromosome. In some embodiments, a locus refers to a small group of nucleotide positions within a genome, e.g., as defined by a mutation (e.g., substitution, insertion, or deletion) of consecutive nucleotides within a cancer genome. Because normal mammalian cells have diploid genomes, a normal mammalian genome (e.g., a human genome) will generally have two copies of every locus in the genome, or at least two copies of every locus located on the autosomal chromosomes, e.g., one copy on the maternal autosomal chromosome and one copy on the paternal autosomal chromosome.
As used herein, the term “allele” refers to a particular sequence of one or more nucleotides at a chromosomal locus.
As used herein, the term “reference allele” refers to the sequence of one or more nucleotides at a chromosomal locus that is either the predominant allele represented at that chromosomal locus within the population of the species (e.g., the “wild-type” sequence), or an allele that is predefined within a reference genome for the species.
As used herein, the term “variant allele” refers to a sequence of one or more nucleotides at a chromosomal locus that is either not the predominant allele represented at that chromosomal locus within the population of the species (e.g., not the “wild-type” sequence), or not an allele that is predefined within a reference genome for the species.
As used herein, the term “single nucleotide variant” or “SNV” refers to a substitution of one nucleotide to a different nucleotide at a position (e.g., site) of a nucleotide sequence, e.g., a sequence read from an individual. A substitution from a first nucleobase X to a second nucleobase Y may be denoted as “X>Y.” For example, a cytosine to thymine SNV may be denoted as “C>T.”
As used herein, the term “mutation,” refers to a detectable change in the genetic material of one or more cells. In a particular example, one or more mutations can be found in, and can identify, cancer cells (e.g., driver and passenger mutations). A mutation can be transmitted from apparent cell to a daughter cell. A person having skill in the art will appreciate that a genetic mutation (e.g., a driver mutation) in a parent cell can induce additional, different mutations (e.g., passenger mutations) in a daughter cell. A mutation generally occurs in a nucleic acid. In a particular example, a mutation can be a detectable change in one or more deoxyribonucleic acids or fragments thereof. A mutation generally refers to nucleotides that is added, deleted, substituted for, inverted, or transposed to a new position in a nucleic acid. A mutation can be a spontaneous mutation or an experimentally induced mutation. A mutation in the sequence of a particular tissue is an example of a “tissue-specific allele.” For example, a tumor can have a mutation that results in an allele at a locus that does not occur in normal cells. Another example of a “tissue-specific allele” is a fetal-specific allele that occurs in the fetal tissue, but not the maternal tissue.
As used herein the term “cancer,” “cancerous tissue,” or “tumor” refers to an abnormal mass of tissue in which the growth of the mass surpasses and is not coordinated with the growth of normal tissue. A cancer or tumor can be defined as “benign” or “malignant” depending on the following characteristics: degree of cellular differentiation including morphology and functionality, rate of growth, local invasion and metastasis. A “benign” tumor can be well differentiated, have characteristically slower growth than a malignant tumor and remain localized to the site of origin. In addition, in some cases a benign tumor does not have the capacity to infiltrate, invade or metastasize to distant sites. A “malignant” tumor can be a poorly differentiated (anaplasia), have characteristically rapid growth accompanied by progressive infiltration, invasion, and destruction of the surrounding tissue. Furthermore, a malignant tumor can have the capacity to metastasize to distant sites. Accordingly, a cancer cell is a cell found within the abnormal mass of tissue whose growth is not coordinated with the growth of normal tissue. Accordingly, a “tumor sample” refers to a biological sample obtained or derived from a tumor of a subject, as described herein.
As used herein, a “cancer condition associated with an oncogenic pathogen infection,” either generically or with reference to a specific oncogenic pathogen, refers to the condition in which a cancer subject, afflicted with a specific cancer, is further afflicted with a pathogen (e.g., virus) known to associate with the specific cancer.
As used herein, a “cancer condition that is not associated with an on oncogenic pathogen infection,” either generically or with reference to a specific oncogenic pathogen, refers to the condition in which a cancer subject, afflicted with a specific cancer, is specifically not afflicted with a pathogen (e.g., virus) known to associate with the specific cancer.
As used herein, the terms “sequencing,” “sequence determination,” and the like as used herein refers generally to any and all biochemical processes that may be used to determine the order of biological macromolecules such as nucleic acids or proteins. For example, sequencing data can include all or a portion of the nucleotide bases in a nucleic acid molecule such as an mRNA transcript or a genomic locus.
As used herein, the term “sequence reads” or “reads” refers to nucleotide sequences produced by any sequencing process described herein or known in the art. Reads can be generated from one end of nucleic acid fragments (“single-end reads”), and sometimes are generated from both ends of nucleic acids (e.g., paired-end reads, double-end reads). The length of the sequence read is often associated with the particular sequencing technology. High-throughput methods, for example, provide sequence reads that can vary in size from tens to hundreds of base pairs (bp). In some embodiments, the sequence reads are of a mean, median or average length of about 15 bp to 900 bp long (e.g., about 20 bp, about 25 bp, about 30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130, about 140 bp, about 150 bp, about 200 bp, about 250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp. In some embodiments, the sequence reads are of a mean, median or average length of about 1000 bp, 2000 bp, 5000 bp, 10,000 bp, or 50,000 bp or more. Nanopore sequencing, for example, can provide sequence reads that can vary in size from tens to hundreds to thousands of base pairs. Illumina parallel sequencing can provide sequence reads that do not vary as much, for example, most of the sequence reads can be smaller than 200 bp. A sequence read (or sequencing read) can refer to sequence information corresponding to a nucleic acid molecule (e.g., a string of nucleotides). For example, a sequence read can correspond to a string of nucleotides (e.g., about 20 to about 150) from part of a nucleic acid fragment, can correspond to a string of nucleotides at one or both ends of a nucleic acid fragment, or can correspond to nucleotides of the entire nucleic acid fragment. A sequence read can be obtained in a variety of ways, e.g., using sequencing techniques or using probes, e.g., in hybridization arrays or capture probes, or amplification techniques, such as the polymerase chain reaction (PCR) or linear amplification using a single primer or isothermal amplification.
As used herein, the term “read segment” or “read” refers to any nucleotide sequences including sequence reads obtained from an individual and/or nucleotide sequences derived from the initial sequence read from a sample obtained from an individual. For example, a read segment can refer to an aligned sequence read, a collapsed sequence read, or a stitched read. Furthermore, a read segment can refer to an individual nucleotide base, such as a single nucleotide variant.
As used herein, the term, “reference exome” refers to any particular known, sequenced or characterized exome, whether partial or complete, of any tissue from any organism or pathogen that may be used to reference identified sequences from a subject. Example reference exomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCBI”).
As used herein, the term “reference genome” refers to any particular known, sequenced or characterized genome, whether partial or complete, of any organism or pathogen that may be used to reference identified sequences from a subject. Exemplary reference genomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCBI”) or the University of California, Santa Cruz (UCSC). A “genome” refers to the complete genetic information of an organism or pathogen, expressed in nucleic acid sequences. As used herein, a reference sequence or reference genome often is an assembled or partially assembled genomic sequence from an individual or multiple individuals. In some embodiments, a reference genome is an assembled or partially assembled genomic sequence from one or more human individuals. The reference genome can be viewed as a representative example of a species’ set of genes. In some embodiments, a reference genome comprises sequences assigned to chromosomes. Exemplary human reference genomes include but are not limited to NCBI build 34 (UCSC equivalent: hg16), NCBI build 35 (UCSC equivalent: hg17), NCBI build 36.1 (UCSC equivalent: hg18), GRCh37 (UCSC equivalent: hg19), and GRCh38 (UCSC equivalent: hg38).
As used herein, the term “minimum edit distance” refers to the minimum number of editing operations required to change one sequence, e.g., a locus within a reference genome, to exactly match another sequence, e.g., a sequence read. With reference to the editing of a locus of a reference genome to match a sequence read, possible editing operations include inserting a nucleotide (e.g., where an alignment between the sequences shows that a gap must exist in the reference sequence in order to align with the sequence read), deleting a nucleotide (e.g., where an alignment between the sequences shows that a gap must exist in the sequence read in order to align to the reference sequence), and substituting one nucleotide for another (e.g., where an alignment between the sequences shows that there is a mismatch at a particular nucleic acid position). In some embodiments, weights are independently assigned to each editing operation when calculating a minimal editing distance score between two sequences, in order to prioritize the importance of one or more particular types of editing operations relative to the other editing operations.
As used herein, the term “assay” refers to a technique for determining a property of a substance, e.g., a nucleic acid, a protein, a cell, a tissue, or an organ. An assay (e.g., a first assay or a second assay) can comprise a technique for determining the copy number variation of nucleic acids in a sample, the methylation status of nucleic acids in a sample, the fragment size distribution of nucleic acids in a sample, the mutational status of nucleic acids in a sample, or the fragmentation pattern of nucleic acids in a sample. Any assay known to a person having ordinary skill in the art can be used to detect any of the properties of nucleic acids mentioned herein. Properties of a nucleic acids can include a sequence, genomic identity, copy number, methylation state at one or more nucleotide positions, size of the nucleic acid, presence or absence of a mutation in the nucleic acid at one or more nucleotide positions, and pattern of fragmentation of a nucleic acid (e.g., the nucleotide position(s) at which a nucleic acid fragments). An assay or method can have a particular sensitivity and/or specificity, and their relative usefulness as a diagnostic tool can be measured using ROC-AUC statistics.
The term “classification” can refer to any number(s) or other characters(s) that are associated with a particular property of a sample. For example, a “+” symbol (or the word “positive”) can signify that a sample is classified as having deletions or amplifications. In another example, the term “classification” can refer to an oncogenic pathogen infection status, an amount of tumor tissue in the subject and/or sample, a size of the tumor in the subject and/or sample, a stage of the tumor in the subject, a tumor load in the subject and/or sample, and presence of tumor metastasis in the subject. The classification can be binary (e.g., positive or negative) or have more levels of classification (e.g., a scale from 1 to 10 or 0 to 1). The terms “cutoff” and “threshold” can refer to predetermined numbers used in an operation. For example, a cutoff size can refer to a size above which fragments are excluded. A threshold value can be a value above or below which a particular classification applies. Either of these terms can be used in either of these contexts.
Several aspects are described below with reference to example applications for illustration. It should be understood that numerous specific details, relationships, and methods are set forth to provide a full understanding of the features described herein. One having ordinary skill in the relevant art, however, will readily recognize that the features described herein can be practiced without one or more of the specific details or with other methods. The features described herein are not limited by the illustrated ordering of acts or events, as some acts can occur in different orders and/or concurrently with other acts or events. Furthermore, not all illustrated acts or events are required to implement a methodology in accordance with the features described herein.
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be apparent to one of ordinary skill in the art that the present disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.
DNA sequencing-based pathogen detection - Now that an overview of some aspects of the present disclosure and some definitions used in the present disclosure have been provided, details of an exemplary system are now described in conjunction with
In various implementations, one or more of the above identified elements are stored in one or more of the previously mentioned memory devices, and correspond to a set of instructions for performing a function described above. The above identified modules, data, or programs (e.g., sets of instructions) need not be implemented as separate software programs, procedures, datasets, or modules, and thus various subsets of these modules and data may be combined or otherwise re-arranged in various implementations. For instance, in some embodiments, sequence alignment data store 136 is integrated in test subject data store 122. Likewise, in some embodiments, rather than having a separate sequence alignment data store 136, the system annotates sequence read entries 128 to indicate the results of the first alignment, second alignment, and/or competitive alignment. For instance, in some embodiments, each entry 128 includes a field for the nucleic acid sequence of the sequence read, a field for the result of alignment against the human reference construct 132 (e.g., whether the sequence read was positively mapped to the human reference construct and/or the location or sequence in the human reference construct that the sequence read was aligned to), a field for the result of alignment against the oncogenic pathogen reference constructs 134 (e.g., whether the sequence read was positively mapped to an oncogenic pathogen reference construct, the identity of the oncogenic pathogen to which the sequence was mapped, and/or the location or sequence in the oncogenic pathogen reference construct that the sequence read was aligned to), and a field for the result of competitive alignment against both the human reference construct 132 and the oncogenic pathogen reference constructs 134 (e.g., the identity of the reference construct to which the sequence read was positively mapped to and/or the location or sequence in the reference construct that the sequence read was aligned to).
In some implementations, the non-persistent memory 111 optionally stores a subset of the modules and data structures identified above. Furthermore, in some embodiments, the memory stores additional modules and data structures not described above. In some embodiments, one or more of the above identified elements is stored in a computer system, other than that of system 100, that is addressable by system 100 so that system 100 may retrieve all or a portion of such data when needed.
RNA sequencing-based pathogen detection –
In various implementations, one or more of the above identified elements are stored in one or more of the previously mentioned memory devices, and correspond to a set of instructions for performing a function described above. The above identified modules, data, or programs (e.g., sets of instructions) need not be implemented as separate software programs, procedures, datasets, or modules, and thus various subsets of these modules and data may be combined or otherwise re-arranged in various implementations. In some implementations, the non-persistent memory 1111 optionally stores a subset of the modules and data structures identified above. Furthermore, in some embodiments, the memory stores additional modules and data structures not described above. In some embodiments, one or more of the above identified elements is stored in a computer system, other than that of visualization system 1100, that is addressable by visualization system 1100 so that visualization system 1100 may retrieve all or a portion of such data when needed.
Although
For instance, as depicted in
While systems in accordance with the present disclosure has been disclosed with reference to
Many of the embodiments described below, in conjunction with
In block 304, individual sequence reads 128, in electronic form, are aligned against a reference human data construct 132, e.g., a reference human genome or reference human exome, using sequence alignment module 130. In some embodiments, the alignment is performed with an index-based alignment algorithm, e.g., a hash-based sequence alignment algorithm. The index-based alignment algorithm runs more quickly than a conventional local alignment algorithm, but generally with lower performance such that, overall, fewer sequence reads will be correctly mapped to a position within the reference human data construct. There are two advantages to the use of an index-based alignment algorithm at this step: first, the alignment is less computationally burdensome, resulting in a quicker and more efficient computational process, and second, fewer sequence reads with significant identity to both the human reference construct and to an oncogenic pathogen reference construct are aligned to the human reference construct and, thus, removed from the data set prior to subsequent alignment to the oncogenic pathogen reference construct, resulting in improved sensitivity for the detection of oncogenic pathogen-derived sequence reads. The result of block 304 is a partitioning of the sequencing data 124 into a first subset of sequence reads 306 (e.g., aligned sequences 140) that definitively map to the human reference construct and a second subset of sequence reads 308 (e.g., unaligned sequences 142) that do not definitively map to the human reference construct.
In block 310, individual sequence reads 142 in the second subset of sequence reads 308 are aligned against a plurality of oncogenic pathogen reference constructs 134, e.g., reference genomes or reference exomes for a plurality of oncogenic pathogens. In some embodiments, the alignment is performed with an index-based alignment algorithm, e.g., a hash-based sequence alignment algorithm. The index-based alignment algorithm runs more quickly and efficiently than a conventional local alignment algorithm.
In some embodiments, where both the alignment against the human reference construct and the alignment against the oncogenic pathogen reference constructs are performed using the same sequence alignment algorithm, a parameter of the sequence alignment algorithm is defined more stringently during the alignment against the human reference construct than during the alignment against the oncogenic pathogen reference constructs. In this fashion, more sequences that align to both the human reference construct and one or more of the oncogenic pathogen reference constructs are identified because (i) they are not removed from the analysis by being assigned to subset 306 of sequence reads that definitively align to the human reference construct, and are therefore not aligned against the oncogenic pathogen reference constructs, and (ii) are identified as aligning to an oncogenic pathogen reference construct because of the lower stringency requirements for assignment of a positive alignment. Subsequently, these sequences can be further queried to determine whether they align better to the human reference construct or the oncogenic pathogen reference construct, as described below.
In other embodiments, sequence reads 306 that are identified as aligning to the human reference construct (e.g., aligned sequence reads 140) are also aligned against one or more of the oncogenic pathogen reference constructs 134. In some embodiments, sequence reads 306 are aligned against all of the oncogenic pathogen reference constructs in the same fashion that unmapped sequence reads 308 are aligned to the oncogenic pathogen reference constructs. In some embodiments, sequence reads 306 that are identified as aligning to the human reference construct are aligned against just a subset of oncogenic pathogen reference constructs, e.g., primary oncogenic pathogen reference constructs, in the same fashion that unmapped sequence reads 308 are aligned to the primary target oncogenic pathogen reference constructs. In some embodiments, sequence reads 306 are aligned against all of a subset of the oncogenic pathogen reference constructs using a different alignment algorithm, e.g., one that runs faster than, but may be less sensitive than, the alignment algorithm used to align unmapped sequence reads 308 against the oncogenic pathogen reference constructs.
In some embodiments, alignment of sequence reads 308 against the plurality of oncogenic pathogen reference constructs is performed in two steps. First, each of the sequence reads is aligned (312) against a sub-plurality of reference constructs for one of more primary target oncogenic pathogens. Second, each sequence read that did not align to any one of the sub-plurality of reference constructs is aligned against the other oncogenic pathogen reference constructs in the plurality of oncogenic pathogen reference constructs. In some embodiments, where a hybridization probe set is used to enrich target nucleic acids from the biological sample, the hybridization probe set includes a sub-set of probes complementary to nucleic acid sequences from the one or more primary target oncogenic pathogens, e.g., but does not include probes complementary to other oncogenic pathogens. The result of block 310 is partitioning of sequence reads 308 into a third subset of sequence reads 313 that do not map to either the human reference construct or any of the oncogenic pathogen reference constructs (e.g., unaligned sequence reads 146) and a fourth subset of sequence reads that align to at least one of the oncogenic pathogen reference constructs (e.g., aligned sequence reads 144).
In some embodiments, sequence reads that are putatively mapped to at least one of the oncogenic pathogen reference constructs (e.g., aligned sequence reads 144) are then competitively aligned against the at least one oncogenic pathogen reference construct 134 and the human reference construct 132, to determine which reference construct each sequence read aligns to better. In some embodiments, the competitive alignment is performed with a local sequence alignment algorithm, e.g., which aligns each nucleotide, rather than an index-based alignment algorithm. Although local sequence alignment algorithms require more computational resources, the algorithm is more sensitive and therefore performs better than an index-based sequence alignment algorithm on average. Advantageously, because the majority of the original sequencing data has been removed by assignment to mapped human reads 306 (e.g., aligned sequence reads 140) or unmapped reads 313 (e.g., unaligned sequence reads 146), e.g., using less computationally taxing alignment algorithms, this process facilitates high confidence assignment of oncogenic pathogen sequence reads 318 more quickly than if all of the sequencing data was aligned to the oncogenic pathogen reference constructs, providing a more efficient computational process (e.g., the set of aligned sequence reads 144 is much smaller than the set of all sequence reads 128 for a subject).
The method includes tracking sequence reads identified as aligning to one or more oncogenic pathogen reference constructs. The number of sequence reads that are finally aligned to each oncogenic pathogen following the competitive alignment (316), e.g., mapped oncogenic pathogen reads 318, are counted, e.g., using oncogenic pathogen identification module and stored in oncogenic pathogen alignment tracking data store 152, as counts 156 for each pathogen. In some embodiments, as depicted in box 320, sequence counts 156 for the alignment data are normalized, e.g., to account for pull-down, amplification, and/or sequencing bias (e.g., mappability, GC bias etc.). See, for example, Schwartz et al., 2011, “Detection and Removal of Biases in the Analysis of Next-Generation Sequencing Reads,” PLoS ONE 6(1): e16685.doi:10.1371/journal.pone.0016685; and Benjamini and Speed, 2012 “Summarizing and correcting the GC content bias in high-throughput sequencing,” Nucleic Acids Research 40(10) e72, each of which is hereby incorporated by reference.
A determination (322) is then made as to whether a threshold number of sequences aligning to each of the one or more oncogenic pathogen reference constructs have been identified. If a threshold number sequences aligning to a respective oncogenic pathogen reference construct have been identified, the subject is classified (326) as afflicted by the respective oncogenic pathogen. If a threshold number sequences aligning to a respective oncogenic pathogen reference construct have not been identified, the subject is classified (324) as not afflicted by the respective oncogenic pathogen.
In some embodiments, the classification for each respective oncogenic pathogen is used to inform classification of the subject’s cancer, e.g., to determine a type of cancer, a primary origin of the cancer, a prognosis for the cancer, and/or a recommendation for treating the cancer. Non-limiting examples of oncogenic pathogens that are known to be associated with specific cancers are shown below in Table 1. For additional information on known associations between oncogenic pathogens and cancers see, for example, Flora and Bonanni, 2011, “The prevention of infection-associated cancers,” Carcinogenesis 32(6), pp. 787-795, which is hereby incorporated by reference.
Helicobacter pylori
Streptococcus bovis
Salmonella typhi
Clamydophila pneumonia
Schistosoma haematobium
Schistosoma japonicum
As used herein, the term “human gut microbiome” refers to all of the microorganisms living in the human digestive tract, a subset of which have been found to be oncogenic. For example, pathogens that have been hypothesized to cause, or are correlated with, colon or colorectal cancers include Sulfidogenic bacteria (e.g. Fusobacterium, Desulfovibrio, and Bilophila wadsworthia), Streptococcus bovis, and Fusobacterium nucleatum. For further information, see, Dahmus et al., 2018, J Gastrointest Oncol., 9(4), pp. 769-77, which is hereby incorporated by reference herein.
In some embodiments, the classification for each respective oncogenic pathogen is used to generate a clinical report that indicates whether the subject is afflicted with an oncogenic pathogen. In some embodiments, the clinical report provides additional information about the subject’s cancer, e.g., a type of cancer, a primary origin of the cancer, a stage of the cancer, a tumor burden for the subject, a prognosis for the subject, a recommended treatment for the cancer, etc. An example of such a clinical report is shown in
Now that an overview of the disclosed methods has been provided in conjunction with
In some embodiments, method 5000 is performed, at least partially, at a computer system (e.g., computer system 100 in
Although method 5000 includes steps of obtaining nucleic acids from a biological sample from a subject and hybridizing the nucleic acid to a probe set, in some embodiments the disclosed methods begin by obtaining sequence data from the isolated nucleic acids, as illustrated in
In some embodiments, method 5000 includes obtaining (5002) an amount of nucleic acid from a biological sample of the subject, where the amount of nucleic acid includes nucleic acid from the subject and potentially nucleic acid from at least one oncogenic pathogen in a plurality of oncogenic pathogens. In some embodiments, the plurality of oncogenic pathogens includes one or more members of the papillomavirus family, one or more members of the herpes virus family, and/or one or more members of the murine polyomavirus group (5010).
Generally, the biological sample of the subject is a biopsy, e.g., a sample of cancerous tissue from the subject. Methods for obtaining samples of cancerous tissue are known in the art, and are dependent upon the type of cancer being sampled. For example, bone marrow biopsies and isolation of circulating tumor cells can be used to obtain samples of blood cancers, endoscopic biopsies can be used to obtain samples of cancers of the digestive tract, bladder, and lungs, needle biopsies (e.g., fine-needle aspiration, core needle aspiration, vacuum-assisted biopsy, and image-guided biopsy, can be used to obtain samples of subdermal tumors, skin biopsies, e.g., shave biopsy, punch biopsy, incisional biopsy, and excisional biopsy, can be used to obtain samples of dermal cancers, and surgical biopsies can be used to obtain samples of cancers affecting internal organs of a patient. In some embodiments, the biological sample is a solid biopsy (5030). In some embodiments, the solid biopsy is a macro-dissected formalin fixed paraffin embedded (FFPE) tissue section (5032). In some embodiments, the biological sample comprises blood or saliva (5034). In some embodiments, the subject has cancer (5036).
Similarly, methods for isolating nucleic acids from biological samples are known in the art, and are dependent upon the type of nucleic acid being isolated, e.g., DNA or RNA, and the type of sample from which the nucleic acids are being isolated. For instance, many techniques for DNA isolation, e.g., genomic DNA isolation, from a tissue sample are known in the art, such as organic extraction, silica adsorption, and anion exchange chromatography. Likewise, many techniques for RNA isolation, e.g., mRNA isolation, from a tissue sample are known in the art. For example, acid guanidinium thiocyanate-phenol-chloroform extraction (see, for example, Chomczynski and Sacchi, 2006, Nat Protoc, 1(2):581-85, which is hereby incorporated by reference herein), and silica bead/glass fiber adsorption (see, for example, Poeckh, T. et al., 2008, Anal Biochem., 373(2):253-62, which is hereby incorporated by reference herein). The selection of any particular DNA or RNA isolation technique for use in conjunction with the embodiments described herein is well within the skill of the person having ordinary skill in the art, who will consider the tissue type, the state of the tissue, e.g., fresh, frozen, formalin-fixed, paraffin-embedded (FFPE), and the type of nucleic acid analysis that is to be performed.
In some embodiments, the plurality of oncogenic pathogens includes one or more oncogenic viruses (5004). For example, in some embodiments, the plurality of oncogenic pathogens includes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, or more oncogenic viruses. In some embodiments, each oncogenic pathogen in the plurality of oncogenic pathogens is an oncogenic virus (5006). In some embodiments, an oncogenic pathogen in the plurality of oncogenic pathogens is an oncogenic virus listed in Table 1 (5008). For further information on oncogenic viruses see, for example, de Flora, 2011, Carcinogenesis 32:787-95, which is incorporated by reference herein.
In some embodiments, the plurality of oncogenic pathogens includes a member of the papillomavirus family of viruses. Papillomaviruses are non-enveloped DNA viruses, for which several hundred species have been identified see, for example, Van Doorslaer K. et al., J Gen Virol., 99(8):989-990 (2018), which is incorporated by reference herein. In some embodiments, the member of the papillomavirus family is human papillomavirus (HPV) (5012). In some embodiments, the human papillomavirus is HPV16, HPV18, HPV31, HPV33, HPV35, HPV39, HPV45, HPV51, HPV52, HPV56, HPV58, HPV59 or HPV68 (5014). For more information on the various species of human papillomavirus see, for example, Chouhy D. et al., 2013, J Gen Virol., 94(11):2480-88, which is incorporated by reference herein. In some embodiments, the one or more human papillomaviruses includes HPV16 or HPV18 (5016), both of which are known to be associated with human cancers see, for example, Saraiya M. et al., 2015, Natl Cancer Inst., 107(6), which is incorporated by reference herein.
In some embodiments, the plurality of oncogenic pathogens includes a member of the herpes virus family. Herpesviridae are enveloped, monopartite, double-stranded, linear DNA viruses; see, for example, Mettenleiter et al., 2008, “Animal Viruses: Molecular Biology,” Caister Academic Press, Chapter 9 “Molecular Biology of Animal Herpesviruses,” which is incorporated by reference herein. Nine species of herpesviridae are known to infect humans, including herpes simplex viruses 1 and 2 (HSV-1 and HSV-2), varicella-zoster virus (VZV), Epstein-Barr virus (EBV), human cytomegalovirus (HCMV), human herpesvirus 6A and 6B (HHV-6A and HHV-6B), human herpesvirus 7 (HHV-7), and Kaposi’s sarcoma-associated herpesvirus (KSHV). Many of these species have been associated with human cancers. For example, Epstein-Barr virus (EBV) has been linked to several human neoplasms, including Burkitt’s lymphoma, sinonasal angiocentric T-cell lymphoma, immunosuppressor-related non-Hodgkin’s lymphoma, Hodgkin’s lymphoma, nasopharyngeal carcinoma, Gastric Carcinoma; see, for example, Rezk SA et al., Hum Pathol., 79:18-41 (2018), which is incorporated by reference herein. Human cytomegalovirus (HCMV) has been associated with oncomodulation and oncogenesis in various cancers, including glioma, colorectal cancer, prostate cancer, breast cancer, mucoepidermoid carcinoma, medulloblastoma, and neuroblastoma; see, for example, Herbein G., Viruses, 10(8):408 (2018), which is incorporated by reference herein. Kaposi’s sarcoma-associated herpesvirus (KSHV) has been associated with Kaposi’s sarcoma and primary effusion lymphoma; see, for example, Goncalves PH et al., Curr Opin HIV AIDS, 12(1):47-56 (2017), which is incorporated herein by reference. Additionally, some studies have suggested a link between human herpesvirus 6A and 6B (HHV-6A and HHV-6B) and various cancers, including lymphomas, gliomas, gastrointestinal cancers, cervical cancer, and leukemia; for review see HHV-6 Foundation “HHV-6 & Cancer,” published online. Accordingly, in some embodiments, the one or more members of the herpes virus family includes Epstein-Barr virus (5018). In some embodiments, the member of the herpes virus family is Human cytomegalovirus (HCMV). In some embodiments, the member of the herpes virus family is Kaposi’s sarcoma-associated herpesvirus (KSHV). In some embodiments, the member of the herpes virus family is human herpesvirus 6 (e.g., HHV-6A and/or HHV-6B).
In some embodiments, the plurality of oncogenic pathogens includes a member of the of the polyomavirus family of viruses. Polyomaviruses are non-enveloped, double-stranded, circular DNA viruses; see, for example, Moens et al., 2017, Journal of General Virology, 98:1159-60, which is incorporated by reference herein. Merkel cell polyomavirus (MCPyV), a member of the polyomavirus family, has been associated with Merkel cell carcinomas; see, for example, Rotondo et al., 2017, Clin Cancer Res., 23(14):3929-34, which is incorporated by reference herein. Accordingly, in some embodiments, the one or more member of the polyomavirus family includes Merkel cell polyomavirus (5020).
In some embodiments, the plurality of oncogenic pathogens includes one or more oncogenic bacterium (5022). Several bacteria have been linked to various cancers, including Bacteroides fragilis (colon cancer), Borrelia burgdorferi (MALT lymphoma), Campylobacter jejuni (Immunoproliferative small intestinal disease (IPSID)), Chlamydia pneumonia (Lung MALT lymphoma), Chlamydia trachomatis (Cervical cancer), Chlamydophila psittaci (Ocular/adnexal lymphoma), Clostridiumssp. (Colon cancer), Helicobacter bilis, (gallbladder and biliary tract cancers), Helicobacter bizzozeronii (Gastric MALT lymphoma), Helicobacter felis (Gastric MALT lymphoma), Helicobacter heilmannii (Gastric MALT lymphoma), Helicobacter hepaticus (Biliary cancer), Helicobacter pylori (Stomach cancer), Helicobacter salomonis (Gastric MALT lymphoma), Helicobacter suis (Gastric MALT lymphoma), Mycoplasmaspp. (Stomach, colon, ovarian, and lung cancers), Neisseria gonorrhoeae (Bladder and prostate cancer), Cutibacterium acnes (Bladder and prostate cancer), Salmonella enterica serovar Paratyphi (Biliary cancer), Salmonella enterica serovar Typhimurium (Biliary cancer), and Treponema pallidum (Bladder and prostate cancer). See, for example, Sinkovics, 2012, Int. J. Oncol. 40(2):305-49; Chang and Parsonnet, 2010, J, Clin. Microbiol. Rev. 23(4):837-57, which are incorporated by reference herein. In some embodiments, the oncogenic bacterium is an oncogenic bacterium listed in Table 1 (5024).
In some embodiments, the plurality of oncogenic pathogens includes one or more oncogenic trematodes (5026). Several trematodes have been linked to various cancers, including Schistosoma haematobium (bladder cancer), Opisthorchis viverrini (bile duct cancer), and Clonorchis sinensis (bile duct cancer). See, for example, Bouvard et al., 2009, Lancet Oncol. 10(4):321-22. In some embodiments, the oncogenic trematode is an oncogenic trematode listed in Table 1.
Yet other types of oncogenic pathogens have been identified, including protozoan parasites (e.g., Toxoplasma gondii, Cryptosporidium parvum, Trichomonas vaginalis, Theileria, and Plasmodium falciparum), tapeworms (e.g., Echinococcus granulosus and Taenia solium), liver flukes (e.g., Fasciola gigantica and Platynosomum fastosum), and roundworms (e.g., Strongyloides stercoralis, Heterakis gallinarum, and Trichuris muris). For more information on other oncogenic parasites see, for example, Machicado and Marcos, 2016, Int. J. Cancer 138(12):2915-21, which is incorporated by reference herein.
In some embodiments, the methods described herein include enriching nucleic acids isolated from the biological sample for target sequences associated with cancer classification. Advantageously, enriching for target sequences prior to sequencing the nucleic acids significantly reduces the costs and time associated with sequencing, facilitates multiplex sequencing by allowing multiple samples to be mixed together for a single sequencing reaction, and significantly reduces the computation burden of aligning the resulting sequence reads, as a result of significantly reducing the total amount of nucleic acids analyzed from each sample. Accordingly, in some embodiments, method 5000 includes hybridizing (5038) the amount of nucleic acid to a probe set, where the probe set includes a plurality of nucleic acid probes for a plurality of human genomic loci and a respective set of nucleic acid probes for genomic loci of each respective oncogenic pathogen in the plurality of oncogenic pathogens.
Generally, the probes include DNA, RNA, or a modified nucleic acid structure with a base sequence that is complementary to a locus of interest. Accordingly, when the probe is designed to hybridize to an mRNA molecule isolated from the biological sample, the probe will include a nucleic acid sequence that is complementary to the coding strand of the gene from which the transcript originated, i.e., the probe will include an antisense sequence of the gene. However, when the probe is designed to hybridize to a loci in a gDNA molecule or cDNA molecule, the probe can contain either a sequence that is complementary to either strand, because the molecules in the gDNA or cDNA library are double stranded. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 15 consecutive bases of a locus of interest. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 20, 25, 30, 40, 50, 75, 100, 150, 200, or more consecutive bases of a locus of interest.
In some embodiments, the probes include additional nucleic acid sequences that do not share any homology to the loci of interest. For example, in some embodiments, the probes also include nucleic acid sequences containing an identifier sequence, e.g., a unique molecular identifier (UMI), e.g., that is unique to a particular sample or subject. Examples of identifier sequences are described, for example, in Kivioja et al., 2011, Nat. Methods 9(1), pp. 72-74 and Islam et al., 2014, Nat. Methods 11(2), pp. 163-66, which are incorporated by reference herein. Similarly, in some embodiments, the probes also include primer nucleic acid sequences useful for amplifying the nucleic acid molecule of interest, e.g., using PCR. In some embodiments, the probes also include a capture sequence designed to hybridize to an anti-capture sequence for recovering the nucleic acid molecule of interest from the sample.
Likewise, in some embodiments, the probes each include a non-nucleic acid affinity moiety covalently attached to nucleic acid molecule that is complementary to the loci of interest, for recovering the nucleic acid molecule of interest. Non-limited examples of non-nucleic acid affinity moieties include biotin, digoxigenin, and dinitrophenol. In some embodiments, the probe is attached to a solid-state surface or particle, e.g., a dip-stick or magnetic bead, for recovering the nucleic acid of interest. In some embodiments, the methods described herein include amplifying (5060) the nucleic acids that bound to the probe set prior to further analysis, e.g., sequencing. Methods for amplifying nucleic acids, e.g., by PCR, are well known in the art.
The human genomic loci can include gene loci, e.g., exon or intron loci, as well as non-coding loci, e.g., regulatory loci and other non-coding loci, which have been found to be associated with cancer. In some embodiments, the plurality of human genomic loci include at least 25, 50, 100, 150, 200, 250, 300, 350, 400, 500, 750, 1000, 2500, 5000, or more human genomic loci. In one embodiment, the plurality of human genomic loci include at least fifty human genomic loci (5040). In one embodiment, the plurality of human genomic loci includes at least fifty human genomic loci selected from
In some embodiments, the probe set includes probes to genomic loci in one or more oncogenic pathogens selected from alphapapillomavirus (APV), gammaherpesvirus (GHV), HBV genotype A, HPV16, HPV18, HPV33, EBV, MCPyV, Bacteroides fragilis, Helicobacter pylori, Serratia marcescens, and Chlamydia trachomatis. Examples of loci in genes encoded by each of these oncogenic pathogens are provided in Table 2. In some embodiments, the probe set includes probes to at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 25, at least 30, at least 35, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 125, at least 150, at least 175, or of the loci listed in Table 2. In some embodiments, the respective set of nucleic acid probes for the genomic loci of each respective oncogenic pathogen in the plurality of oncogenic pathogens include probes collectively representing at least four of the portions of viral and/or bacterial genomes listed in Table 2 (5062). In some embodiments, the respective set of nucleic acid probes for the genomic loci of each respective oncogenic pathogen in the plurality of oncogenic pathogens include probes collectively representing at least ten of the portions of viral and/or bacterial genomes listed in Table 2 (5064). In some embodiments, the respective set of nucleic acid probes for the genomic loci of each respective oncogenic pathogen in the plurality of oncogenic pathogens include probes collectively representing all of the portions of viral genomes listed in Table 2. A portion or all of the probes listed may be used for DNA-sequencing and/or for RNA-sequencing. In one example, probes targeting alphapapillomavirus, HBV, HPV16, HPV18, HPV33, EBV (or human gammaherpesvirus 4), human gammaherpesvirus 8, MCPyV, Bacteroides fragilis, Helicobacter pylori, Serratia marcescens, and Chlamydia trachomatis are used for DNA-sequencing and probes targeting alphapapillomavirus, gammaherpesvirus, HBV, HPV16, HPV18, HPV33, EBV, MCPyV, Bacteroides fragilis, Helicobacter pylori, and Chlamydia trachomatis are used for RNA-sequencing.
B. fragilis
B. fragilis
B. fragilis
B. fragilis
B. fragilis
B. fragilis
B. fragilis
B. fragilis
B. fragilis
B. fragilis
B. fragilis
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
H. pylori
S. marcescens
S. marcescens
S. marcescens
S. marcescens
S. marcescens
S. marcescens
S. marcescens
S. marcescens
C. trachomatis
C. trachomatis
C. trachomatis
C. trachomatis
C. trachomatis
C. trachomatis
C. trachomatis
The methods described herein include obtaining a plurality of sequence reads, in electronic form, of nucleic acids isolated from the biological sample from the subject. In some embodiments, the sequence reads are obtained from a nucleic acid sample that has been enriched for target sequences, as described above. Advantageously, as described above, sequencing a nucleic acid sample that has been enriched for target nucleic acids, rather than all nucleic acids isolated from a biological sample, significantly reduces the average time and cost of the sequencing reaction. Accordingly, in some embodiments, method 5000 includes obtaining (5070) a plurality of sequence reads (e.g., sequence reads 128) of the nucleic acid hybridized to the probe set, e.g., as described above.
In some embodiments, the sequence reads have an average length of at least fifty nucleotides (5072). In other embodiments, the sequence reads have an average length of at least 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, or more nucleotides.
In some embodiments, the plurality of sequence reads are DNA sequence reads (5074). That is, the nucleic acids isolated from the biological sample are DNA molecules, e.g., genomic DNA (gDNA) molecules or fragments (such as cell-free DNA) thereof.
In some embodiments, the plurality of sequence reads are RNA sequence reads (5076). That is, the nucleic acids isolated from the biological sample are RNA molecules, e.g., mRNA. In some embodiments, RNA sequence reads are obtained directly from the isolated RNA, e.g., by direct RNA sequencing. Methods for direct RNA sequencing are well known in the art. See, for example, Ozsolak et al., 2009, Nature 461:814-18, and Garalde et al., 2018, Nat Methods, 15(3):201-206, which are incorporated by reference herein.
In other embodiments, RNA sequence reads are obtained through a cDNA intermediate. Accordingly, in some embodiments, the isolated RNA is used to create a cDNA library via cDNA synthesis. In some embodiments, both for direct RNA sequencing and prior to cDNA library construction, the isolated RNA is first enriched for a desired type of RNA (e.g., mRNA) or species (e.g., specific mRNA transcripts), prior to cDNA library construction.
Methods of enriching for desired RNA molecules are also well known in the art. For example, mRNA molecules can be enriched, e.g., relative to other RNA molecules in a total RNA preparation, using oligo-dT affinity techniques (see, for example, Rio et al., 2010, Cold Spring Harb Protoc., 2010(7), which is incorporated by reference herein). Specific mRNA transcripts can also be isolated, e.g., using hybridization probes that specifically bind to one or more mRNA sequences of interest.
cDNA library construction from isolated mRNAs is also well known in the art. In some embodiments, cDNA library construction is performed by first-strand DNA synthesis from the isolated mRNA using a reverse transcriptase, followed by second-strand synthesis using a DNA polymerase. Example methods for cDNA synthesis are described in McConnell and Watson, 1986, FEBS Lett. 195(1-2), pp. 199-202; Lin and Ying, 2003, Methods Mol Biol. 221, pp. 129-143, and Oh et al., 2003, Exp Mol Med. 35(6), pp. 586-90, which are incorporated by reference herein.
Methods for mRNA sequencing are well known in the art. In some embodiments, the mRNA sequencing is performed by whole exome sequencing (WES). Generally, WES is performed by isolating RNA from a tissue sample, optionally selecting for desired sequences and/or depleting unwanted RNA molecules, generating a cDNA library, and then sequencing the cDNA library, for example, using next generation sequencing (NGS) techniques. For a review of the use of whole exome sequencing techniques in cancer diagnosis, see, Serratì et al., 2016, Onco Targets Ther. 9, pp. 7355-7365, which is incorporated by reference herein.
RNA-Seq is a methodology used for RNA profiling based on next-generation sequencing that enables the measurement and comparison of gene expression patterns across a plurality of subjects. In some embodiments, millions of short strings, called ‘sequence reads,’ are generated from sequencing random positions of cDNA prepared from the input RNAs that are obtained from tumor tissue of a subject. These reads can then be computationally mapped on a reference genome to reveal a ‘transcriptional map’, where the number of sequence reads aligned to each gene gives a measure of its level of expression (e.g., abundance). Next-generation sequencing is disclosed in Shendure, 2008, “Next-generation DNA sequencing,” Nat. Biotechnology 26, pp. 1135-1145, which is incorporated by reference herein. RNA-Seq is disclosed in Nagalakshmi et al., 2008, “The transcriptional landscape of the yeast genome defined by RNA sequencing,” Science 320, pp. 1344-1349; and Finotell and Camillo, 2014, “Measuring differential gene expression with RNA-seq: challenges and strategies for data analysis,” Briefings in Functional Genomics 14(2), pp. 130-142, which are incorporated by reference herein. Briefly, RNA molecules isolated from a biological sample are initially fragmented and reverse-transcribed into complementary DNAs (cDNAs). The obtained cDNAs are then amplified and subjected to next-generation DNA sequencing (NGS). In principle, any NGS technology can be used for RNA-Seq. In some embodiments, the Illumina sequencer (see the Internet at illumina.com) is used. See, Wang et al., 2009, “RNA-Seq: a revolutionary tool for transcriptomics,” Nat Rev Genet., 10(1):57-63, which is incorporated by reference herein. The millions of short reads generated for each such sample are then mapped on a reference genome and the number of reads aligned to each gene, called ‘counts’, gives a digital measure of gene expression levels in the sample under investigation.
Methods for next generation sequencing, which can be used for either DNA or RNA sequencing, are well known in the art. These include sequencing-by-synthesis technology (Illumina), pyrosequencing (454 Life Sciences), ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing (Pacific Biosciences), sequencing by ligation (SOLiD sequencing), nanopore sequencing (Oxford Nanopore Technologies), or paired-end sequencing. In some embodiments, massively parallel sequencing is performed using sequencing-by-synthesis with reversible dye terminators.
The methods for detecting oncogenic pathogens described herein proceed through a computational subtractive process in which sequences that definitively align to a human reference genome are identified and removed from the dataset before the remaining sequence reads are aligned against oncogenic pathogen reference constructs (e.g., as illustrated in steps 304 and 310 in
In some embodiments, an index-based alignment algorithm is used to decrease the computational time needed to align the sequence reads to the human reference genome. Index-based algorithms construct auxiliary data structures for either or both the read sequences or the reference sequence, and use these structures, which are less complex than the raw sequence, when searching for matches between the read sequences and the reference sequence. Three examples of index-based alignment algorithms are (i) algorithms that use hash tables, (ii) algorithms that are based on suffix trees, and (iii) algorithms based on merge sorting. See, for example, Li and Homer, 2010, Brief Bioinform. 11(5):473-83, which is incorporated by reference herein. Such algorithms are used to exclude large parts of the human reference genome from the expensive dynamic programming comparison used to align a sequence read to the human genome. See, Canzar and Stazberg, 2018, “Short Read Mapping: An Algorithmic Tour,” Proc IEEE Inst. Electr Electron Eng., 105(3), 436-458, which is hereby incorporated by reference.
In one embodiment, the alignment (5082) of the sequence reads against the human reference genome uses a hash-based algorithm. For instance, in some embodiments sequence reads are mapped to the human reference genome using a hash-based algorithm and then aligned using a dynamic programming algorithm. Hash-based algorithms rely on generation of a hash table index of the reference sequence (e.g., a human reference genome), based on k-mers of a particular seed length of the sequence. Query sequences (e.g., sequence reads) are then broken into k-mers of the same length, and the algorithm uses the hash table index to identify regions in the reference sequence that share multiple k-mers with a query sequence. See, for example, Lee WP et al., 2014 PLoS One, 9(3):e90581. Examples of hash-based alignment algorithms include BLAST, MAQ, ZOOM, RMAP, CloudBurst, Eland, mrFAST/mrsFAST, SHRiMP, MOM, MOSAIK, PASS, ProbeMatch, SOAP, SRmapper, and STAMPY. Accordingly, in some embodiments, the alignment of the respective sequence read includes (5084) using a hash table of the human reference genome, where the hash table uses a seed length that is at least sixteen nucleotides in length to hash a plurality of reference seeds drawn from the human reference genome. In some embodiments, the hash table uses a seed length that is from 10 nucleotides to 30 nucleotides in length. In some embodiments, the hash table uses a seed length that is from 15 nucleotides to 25 nucleotides in length. In some embodiments, the seed length is between 18 nucleotides and 22 nucleotides (5088). In some embodiments, the seed length is 20 nucleotides (5090). In yet other embodiments, the hash table uses a seed length that is at least 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or more nucleotides in length. In some embodiments, the hash table uses a rolling window hash, in which the plurality of reference seeds overlap each other on the human reference genome (5086).
Hash-based mapping algorithms require less computation time to identify possible alignments of a sequence read to a reference genome than global alignment algorithms, because the algorithm does not search for each nucleotide individually. However, this can result in the identification of several putative mappings for the sequence read in the reference genome. Accordingly, the system then determines which, if any, of the putative mappings represents a true alignment with the sequence read (e.g., using a dynamic programming algorithm as disclosed in Canzar and Stazberg, 2018, “Short Read Mapping: An Algorithmic Tour,” Proc IEEE Inst. Electr Electron Eng., 105(3), 436-458, which is hereby incorporated by reference). Accordingly, in some embodiments, the alignment (5082) of the sequence reads against the human reference genome includes (i) identifying one or more locations of the human reference genome that match a respective sequence read (mappings) using the hash table, (ii) determining, for each respective location of the one or more locations, a similarity score based upon a minimum edit distance between the respective location and the respective sequence read (e.g., using a dynamic programming algorithm), and (iii) making a determination as to whether the respective sequence read aligns to the human reference genome using at least the best similarity score for the one or more locations of the human reference genome (5092).
In some embodiments, the determination as to whether the sequence read aligns to any particular locus in the reference genome is done by ranking the putative matches to the sequence read and determining whether the highest ranked alignment is significantly better than the other putative matches in order for a positive match to be assigned. In some embodiments, the one or more (putatively matched) locations (in the reference genome) include a plurality of locations that are ranked by their minimum edit distance thereby forming a ranked list of minimum edit distances, where the respective sequence read is determined to align to the human reference genome when a smallest minimum edit distance is smaller than a second most smallest minimum edit distance in the ranked list of minimum edit distances by a threshold amount (5094). Minimal editing distance is the minimum number of operations (insertions, deletions and substitutions) required to convert one string to another. Methods for determining minimal editing distance are known in the art. For example, see, Mantaci S. et al., Int. J. of Approximate Reasoning, 47:109-24, which is incorporated by reference herein.
In some embodiments, minimum similarity standards are required in order for the system to positively match the sequence read to any locus in the reference genome when using a hash-based alignment algorithm. For instance, in some embodiments, a minimal number of seeds derived from the sequence read must match within a particular locus in the reference genome, ensuring that the putative alignment represents alignment of the entire sequence read, as opposed to just a portion of the sequence read, e.g., corresponding to a single seed length of sequence. Accordingly, in some embodiments, the determining (5082) draws a plurality of sequence read seeds from the respective sequence read and performs the identifying (i; 5092) and the determining (ii; 5092) for each sequence read seed in the plurality of sequence read seeds, and the making (iii; 5092) requires at least three sequence read seeds in the plurality of sequence read seeds to a same candidate location of the human reference genome in order for the respective sequence read to be considered aligned to the human reference genome.
In some embodiments, the alignment (5082) of the sequence reads against the human reference genome uses an algorithm based on suffix trees or a suffix array. Examples of these types of algorithms include MUMmer, MUMmeGPU, Vmatch, PacBio Aligner, Bowtie, Bowtie 2, BWA, and BWA-SW. See for example, Langmead Salzberg, 2012, “Fast gapped-read alignment with Bowtie 2,” Nature Methods 9(4):357-359, which is hereby incorporated by reference.
In other embodiments, the alignment (5082) of the sequence reads against the human reference genome uses an algorithm based on merge sorting. Examples of these types of algorithms include Slider and SliderII.
In some embodiments, the alignment of sequence reads against the human reference genome uses SARUMAN, GPU-RMAP, BarraCUDA, SOAP3, SOAP3-dp, CUSHAW, CUSHAW2-GPU, Burrows-Wheeler transform algorithm, a hashing algorithm, pigeonhole, MAQ, RMAP, SOAP, Hobbes, ZOOM, FastHASH, RazerS, RazerS 3, BFAST SEME, SHRiMP, BWT-SW, BWA, Botie, BLASR, Bowtie 2, BWA-SW, GEM, or SOAP2. For further discussion of these alignment algorithms, see Canzar and Stazberg, 2018, “Short Read Mapping: An Algorithmic Tour,” Proc IEEE Inst. Electr Electron Eng., 105(3), 436-458, which is hereby incorporated by reference.
As illustrated in
Publicly accessible databases of microbial and viral genomes are known to those of skill in the art. For instance, the National Center for Biotechnology Information (NCBI) curates publicly accessible databases of microbial genomes, including archaea genomes and bacterial genomes. Likewise, the NCBI also curates publicly accessible databases of viral databases. In some embodiments, a publically-accessible genome database, such as an NCBI database, is used for identifying sequence reads originating from oncogenic pathogens in the sequence reads that were not mapped to the human reference genome (e.g., unaligned sequence reads 142 as shown in
In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes at least 10 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes at least 100 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes at least 1000 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes at least 10,000 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes at least 100,000 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes at least 1,000,000 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes from 10 pathogen genomes to 2,000,000 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes from 100 pathogen genomes to 2,000,000 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes from 1000 pathogen genomes to 2,000,000 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes from 10,000 pathogen genomes to 2,000,000 pathogen genomes. In some embodiments, the set of pathogenic genomic constructs against which the unmapped sequence reads are aligned includes from 100,000 pathogen genomes to 2,000,000 pathogen genomes.
In some embodiments, unmapped sequence reads 308 are first aligned (312) against primary target sequences, e.g., sequences from the genome or exome of an oncogenic pathogen for which a probe was included in the probe set used to enrich nucleic acids isolated from the biological sample from the subject prior to sequencing. In some embodiments, the primary target sequences only include sequences corresponding to the sequences (or complement thereof) of the probes included in the enrichment probe set. In other embodiments, the primary target sequences include whole reference genomes or exomes for the oncogenic pathogens of primary interest.
In some embodiments, after aligning the unmapped sequence reads 308 against the primary target sequences, any remaining sequence reads (e.g., those sequence reads that also did not map to the primary target sequences) are then aligned against a larger database containing reference sequences (e.g., partial or complete reference genomes or exomes, such as the microbial and viral genome databases maintained by the NCBI) for a plurality of other pathogens (e.g., as illustrated in step 314 of
In other embodiments, all of unmapped sequence reads 308 are aligned (314) against a database of reference sequences (e.g., partial or complete reference genomes or exomes) that include the plurality of oncogenic pathogens (e.g., as illustrated in step 314 of
In some embodiments, in a similar fashion as described above with reference to the alignment of sequence reads against the reference human genome, alignment of the remaining unmapped sequence reads 308 to the database of reference sequences can be sped-up by using an index-based sequence alignment algorithm, e.g., an algorithm that uses hash tables, an algorithm that is based on a suffix tree, or an algorithm based on merge sorting.
In one embodiment, the alignment (5098) of the sequence reads against reference constructs for the oncogenic pathogens uses a hash-based alignment algorithm. Accordingly, in some embodiments, method 5000 includes using (5100) a corresponding oncogenic pathogen hash table of the reference genome of the respective oncogenic pathogen, where the corresponding hash table uses a seed length that is at least sixteen nucleotides in length to hash a plurality of reference seeds drawn from the reference genome of the respective oncogenic pathogen. In some embodiments, the hash table uses a seed length that is from 10 nucleotides to 30 nucleotides in length. In some embodiments, the hash table uses a seed length that is from 15 nucleotides to 25 nucleotides in length. In some embodiments, the seed length is between 18 nucleotides and 22 nucleotides. In some embodiments, the seed length is 20 nucleotides. In yet other embodiments, the hash table uses a seed length that is at least 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or more nucleotides in length. In some embodiments, the hash table uses a rolling window hash, in which the plurality of reference seeds overlap each other on each oncogenic pathogen reference construct.
Hash-based alignment algorithms require less computation time to identify possible alignments of a sequence read to a reference genome, because the algorithm does not search for each nucleotide individually. However, this can result in the identification of several putative matches for the sequence read in the reference construct. Accordingly, the system then determines which, if any, of the putative matches represents a true alignment with the sequence read. Accordingly, in some embodiments, the alignment (5098) of the sequence reads against the reference constructs for the oncogenic pathogens includes calculating a corresponding similarity score between the respective sequence read and putative matching loci in the reference genomes for the oncogenic pathogens. In some embodiments, the determination includes ranking the putative matches to the sequence read and determining whether the highest ranked alignment is significantly better enough than the other putative matches in order for a positive match to be assigned. In some embodiments, the one or more (putatively matched) locations (in the reference genome) include a plurality of locations that are ranked by their minimum edit distance thereby forming a ranked list of minimum edit distances, where the respective sequence read is determined to align to the human reference genome when a smallest minimum edit distance is smaller than a second most smallest minimum edit distance in the ranked list of minimum edit distances by a threshold amount. In other embodiments, the sequence read is putatively assigned to match to the locus in an oncogenic pathogen reference genome with the highest similarity score to the sequence read, e.g., regardless of whether that similarity score is significantly better than a similarity score for a second locus from an oncogenic pathogen reference construct. However, in some embodiments, a minimal threshold similarity must be met before any match is assigned.
The result of the alignment against the oncogenic pathogen reference constructs is the partitioning of the remaining sequencing reads into those sequence reads that map to an oncogenic pathogen reference construct and those sequence reads that do not map to an oncogenic pathogen reference construct (e.g., unaligned sequence reads 146).
As shown in
Accordingly, in some embodiments, the alignment (5098) of the sequence reads (e.g., aligned sequence reads 144) against reference constructs for the oncogenic pathogens includes (i) calculating a corresponding similarity score between the respective sequence read and the respective reference genome of the oncogenic pathogen in the plurality of oncogenic pathogens, (ii) labeling the respective sequence read as aligning with human reference genome when the best similarity score between the respective sequence read and the human reference genome exceeds the similarity score between the respective sequence read and the respective reference genome of the oncogenic pathogen in the plurality of oncogenic pathogens, and (iii) labeling the respective sequence read as aligning with a particular oncogenic pathogen in the plurality of oncogenic pathogens when the similarity score between the respective sequence read and the reference genome of the particular oncogenic pathogen exceeds the best similarity score between the respective sequence read and the human reference genome (5102), e.g., forming set 148 of aligned sequence reads.
In some embodiments, the similarity scores determined for the alignment between the sequence read and an oncogenic pathogen, as well as the similarity score determined for the alignment between the sequence read and the human reference genome, are the same similarity score determined when aligning the sequence read against the oncogenic pathogen reference construct and human reference genome, e.g., using a hash-based algorithm.
In some embodiments, the similarity scores determined for the alignment between the sequence read and an oncogenic pathogen, as well as the similarity score determined for the alignment between the sequence read and the human reference genome, are not the same similarity score determined when aligning the sequence read against the oncogenic pathogen reference construct and human reference genome, e.g., using a hash-based algorithm. Rather, in some embodiments, the sequence read is re-aligned to the human reference genome and the oncogenic pathogen reference construct using a local sequence alignment algorithm, which thereby generates a similarity score. A local sequence alignment algorithm compares subsequences of different lengths in the query sequence (e.g., sequence read) to subsequences in the subject sequence (e.g., reference construct) to create the best alignment for each portion of the query sequence. In contrast, global sequence alignment algorithms align the entirety of the sequences, e.g., end to end. Examples of local sequence alignment algorithms include the Smith-Waterman algorithm (see, for example, Smith and Waterman, J Mol. Biol., 147(1):195-97 (1981), which is incorporated herein by reference), Lalign (see, for example, Huang and Miller, Adv. Appl. Math, 12:337-57 (1991), which is incorporated by reference herein), and PatternHunter (see, for example, Ma B. et al., Bioinformatics, 18(3):440-45 (2002), which is incorporated by reference herein).
The result of the competitive alignment step described above is the formation of a sub-plurality of sequence reads 318 that have been positively mapped to an oncogenic pathogen reference construct.
In some embodiments, as shown in
In some embodiments, the hash-based alignment algorithm allows for alignment of a sequence read to an oncogenic pathogen at a family level, e.g., irrespective of which strain of the oncogenic pathogen the sequence originates. This is because hash-based algorithms, e.g., that use edit distance as a parameter, allow for intermediate non-alignment of the query and reference sequences in positive matches. However, in some cases, the identity of the particular strain of the oncogenic pathogen informs the optimal treatment regime for an afflicted subject. Accordingly, in some embodiments, as shown in
In some embodiments, classification of the pathogen strain is performed by competitive alignment of the sequence read against a plurality of reference constructs for the various strains of the oncogenic pathogen. Generally, the competitive alignment is performed by aligning the sequence read to each reference construct, and determining a similarity score for the alignment. The similarity scores are then compared, and the sequence read is assigned to the strain corresponding to the highest similarity score. In some embodiments, the competitive alignment is performed using a local sequence alignment algorithm. As described above, local sequence alignment algorithms (such as the Smith-Waterman algorithm, Lalign, and PatternHunter), require more computational resources than hash-based mapping algorithms, but are more precise than hash-based mapping algorithms.
Accordingly, in some embodiments, the alignment (5098) of the sequence reads against reference constructs for the oncogenic pathogens is performed against a first database that includes at least one reference construct for HPV, at least one reference construct for EBV, and at least one reference construct for MCPyV, e.g., using an index-based alignment algorithm (such as a hash-based alignment algorithm). After one or more sequence reads are aligned to either the HPV reference construct, the EBV reference construct, or the MCPyV reference construct, a competitive alignment is performed between the sequence read and reference constructs for different strains of the HPV, EBV, or MCPyV, e.g., using a second database. In some embodiments, the first database includes at least reference constructs for HPV16, HPV18, and HPV33. In other embodiments, the first database only includes a reference construct for one of HPV16, HPV18, and HPV33. In some embodiments, the first database includes a consensus reference construct for two or more of HPV16, HPV18, and HPV33.
As shown in
Accordingly, in some embodiments, method 5000 includes tracking (5104) for each respective oncogenic pathogen in the plurality of oncogenic pathogens, a number of sequence reads in the plurality of sequence reads that both (i) fail to align to the human reference genome and (ii) align to a reference genome of a respective oncogenic pathogen (e.g., sequence reads 318, as depicted in
Then, method 5000 includes using (5106) the sequence read count for each oncogenic pathogen in the plurality of oncogenic pathogens to ascertain whether the subject is afflicted with an oncogenic pathogen (e.g., as illustrated in step 322 of
Generally, a biological sample from a subject that is afflicted with an oncogenic pathogen results in the identification of from one hundred to several hundred sequence reads that map to the oncogenic pathogen reference construct, using the methods described herein. However, these methods can correctly identify infection at much lower numbers of corresponding sequence reads, e.g., at ten sequence reads or less. Accordingly, in some embodiments, threshold number of sequence reads is between seven and twenty-five sequence reads (5110). In one embodiment, the threshold number or sequence reads is ten sequence reads (5112). In some embodiments, the threshold number or sequence reads is 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 sequence reads.
In some embodiments, the method further identifies which strain of the oncogenic pathogen the subject has been afflicted with. For example, in some embodiments, method 5000 determines that the subject is afflicted with the oncogenic virus, and method 500 includes using the sequence reads that map to a reference genome of the oncogenic virus to determine a strain of the oncogenic virus from among a plurality of strains of the oncogenic virus. For instance, in some embodiments, the using determines that the subject is afflicted with the member of the papillomavirus family, and the method includes using the sequence reads that map to a reference genome of the member of the papillomavirus family to determine a strain of the member of the papillomavirus family from among a plurality of strains of the papillomavirus family (5116). In some embodiments, the strain of the member of the papillomavirus family is HPV16, HPV18, HPV31, HPV33, HPV35, HPV39, HPV45, HPV51, HPV52, HPV56, HPV58, HPV59 or HPV68 (5118).
Similarly, in some embodiments, the using determines that the subject is afflicted with the member of the herpes virus family, and the method includes using the sequence reads that map to a reference genome of the member of the herpes virus family to determine a strain of the member of the herpes virus family from among a plurality of strains of the herpes virus family (5120). In some embodiments, plurality of strains of the herpes virus family includes the Epstein-Barr virus (5122).
Similarly, in some embodiments, the using determines that the subject is afflicted with the member of the murine polyomavirus group, and the method includes using the sequence reads that map to a reference genome of the member of the murine polyomavirus group to determine a strain of the murine polyomavirus group from among a plurality of strains of the murine polyomavirus group (5124). In some embodiments, the strain in the plurality of strains of the murine polyomavirus group is Merkel cell polyomavirus (5126).
In some embodiments, no reference construct for the strain of the oncogenic pathogen the subject is afflicted with will exist. Accordingly, in some embodiments, de novo assembly of the sequence reads data is performed to identify the strain of the pathogen. Specifically, in some embodiments, the using determines that the subject is afflicted with a first oncogenic pathogen in the plurality of oncogenic pathogens, and the method also includes: subjecting the sequence reads for the first oncogenic pathogen in the plurality of sequence reads to de novo assembly thereby reconstructing a consensus sequence of a genome of the first oncogenic pathogen; comparing the genome of the first oncogenic pathogen to the respective reference genome of each strain in one or more known strains of the first oncogenic pathogen; and identifying the first oncogenic pathogen in the subject as a new strain of the first oncogenic pathogen when a homology between the genome of the first oncogenic pathogen and the reference genome of each strain in one or more known strains of the first oncogenic pathogen fails to satisfy a homology criterion (5128). Generally, the homology criteria is between about 80% and about 100%. In one embodiment, the homology criteria is 90% (5130). In other embodiments, the homology criteria is about 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90%, 91%, 92%, 93%, 94%, or 95%.
Another aspect of the present disclosure provides methods for discriminating between a first cancer condition and a second cancer condition in a subject, where the first cancer condition is associated with infection by a first oncogenic pathogen and the second cancer condition is associated with an oncogenic pathogen-free status. The method includes obtaining a dataset for the subject, the dataset including a plurality of abundance values, where each respective abundance value in the plurality of abundance values quantifies a level of expression of a corresponding gene, in a plurality of genes, in a cancerous tissue from the subject. The method then includes inputting the dataset to a classifier trained according to the any one of the methodologies described herein.
Another aspect of the present disclosure provides nucleic acid probes for discriminating between a first cancer condition and a second cancer condition in a human subject, where the first cancer condition is associated with an oncogenic pathogen infection and the second cancer condition is associated with an oncogenic pathogen-free status. The nucleic acid probes have nucleic acid sequences that are complementary or identical to sequences of the genes identified as differentially expressed in cancers associated with an oncogenic pathogen infection.
Another aspect of the present disclosure provides a method for discriminating between a first cancer condition and a second cancer condition in a subject with a first type of cancer, where the first cancer condition is associated with infection by a first oncogenic pathogen and the second cancer condition is associated with an oncogenic pathogen-free status. The method includes obtaining a dataset for the subject, the dataset having a plurality of abundance values (e.g., relative mRNA expression values), where each respective abundance value in the plurality of abundance values quantifies a level of expression of a corresponding gene, in a discriminating gene set, in a cancerous tissue from the subject. The method then includes inputting the dataset to a classifier trained to discriminate between at least the first cancer condition and the second cancer condition based on abundance values for the discriminating gene set in a cancerous tissue of a subject, thereby determining the cancer condition of the subj ect.
In some embodiments, the first type of cancer is breast cancer, lung cancer, prostate cancer, colorectal cancer, renal cancer, uterine cancer, pancreatic cancer, esophagus cancer, head/neck cancer, ovarian cancer, hepatobiliary cancer, cervical cancer, thyroid cancer, or bladder cancer.
In some embodiments, the dataset further includes a variant allele count for one or more variant alleles at one or more loci in the genome of the cancerous tissue from the subject.
In some embodiments, the first cancer condition is associated with infection by a first oncogenic pathogen selected from the group consisting of Epstein-Barr virus (EBV), hepatitis B virus (HBV), hepatitis C virus (HCV), human papilloma virus (HPV), human T-cell lymphotropic virus (HTLV-1), Kaposi’s associated sarcoma virus (KSHV), and Merkel cell polyomavirus (MCV).
In some embodiments, the first cancer condition is selected from the group consisting of cervical cancer associated with human papilloma virus (HPV), head and neck cancer associated with HPV, gastric cancer associated with Epstein-Barr virus (EBV), nasopharyngeal cancer associated with EBV, Burkitt lymphoma associated with EBV, Hodgkin lymphoma associated with EBV, liver cancer associated with hepatitis B virus (HBV), liver cancer associated with hepatitis C virus (HCV), Kaposi sarcoma associated with Kaposi’s associated sarcoma virus (KSHV), adult T-cell leukemia/lymphoma associated with human T-cell lymphotropic virus (HTLV-1), and Merkel cell carcinoma associated with Merkel cell polyomavirus (MCV).
In some embodiments, the first cancer condition is associated with infection by a human papillomavirus (HPV) oncogenic virus and the second cancer condition is associated with an HPV-free status, and the discriminating gene set includes at least five genes selected from the genes listed in Table 21. In some embodiments, the first cancer condition is cervical cancer associated with infection by a human papillomavirus (HPV). In some embodiments, the first cancer condition is head and neck cancer associated with infection by a human papillomavirus (HPV). In some embodiments, the discriminating gene set includes at least ten genes selected from the genes listed in Table 21. In some embodiments, the discriminating gene set includes at least twenty genes selected from the genes listed in Table 21. In some embodiments, the discriminating gene set includes at least all twenty-four of the genes listed in Table 21. In some embodiments, the dataset also includes a variant allele count for TP53 (ENSG00000141510) and CDKN2A (ENSG00000147889) in the genome of the cancerous tissue from the subject.
In some embodiments, the method also includes treating the subject for cervical cancer by, when the classifier result indicates that the human cancer patient is infected with an HPV oncogenic virus, administering a first therapy tailored for treatment of cervical cancer associated with an HPV infection, and when the classifier result indicates that the human cancer patient is not infected with an HPV oncogenic virus, administering a second therapy tailored for treatment of cervical cancer not associated with an HPV infection. In some embodiments, the first therapy tailored for treatment of cervical cancer associated with an HPV infection includes a therapeutic vaccine or an adoptive cell therapy. In some embodiments, the second therapy tailored for treatment of cervical cancer not associated with an HPV infection is chemotherapy. In some embodiments, the chemotherapy includes co-administration of cisplatin and a second therapeutic agent selected from the group consisting of 5-fluorouracil, paclitaxel, and bevacizumab.
In some embodiments, the method also includes treating the subject for head and neck cancer by, when the classifier result indicates that the human cancer patient is infected with an HPV oncogenic virus, administering a first therapy tailored for treatment of head and neck cancer associated with an HPV infection, and when the classifier result indicates that the human cancer patient is not infected with an HPV oncogenic virus, administering a second therapy tailored for treatment of head and neck cancer not associated with an HPV infection. In some embodiments, the first therapy tailored for treatment of head and neck cancer associated with an HPV infection includes a therapeutic vaccine, an immune checkpoint inhibitor, or a PI3K inhibitor. In some embodiments, the second therapy tailored for treatment of head and neck cancer not associated with an HPV infection includes chemotherapy. In some embodiments, the chemotherapy includes administration of cisplatin, and the second therapy also includes concurrent radiotherapy or postoperative chemoradiation.
In some embodiments, the first cancer condition is associated with infection by an Epstein-Barr virus (EBV) oncogenic virus and the second cancer condition is associated with an EBV-free status, and the discriminating gene set includes at least five genes selected from the genes listed in Table 4. In some embodiments, the first cancer condition is gastric cancer associated with infection by an Epstein-Barr virus (EBV). In some embodiments, the discriminating gene set includes all nine genes listed in Table 4. In some embodiments, the dataset also includes a variant allele count for TP53 (ENSG00000141510) and PIK3CA (ENSG00000121879) in the genome of the cancerous tissue from the subject.
In some embodiments, the method also includes treating the subject for gastric cancer by, when the classifier result indicates that the human cancer patient is infected with an EBV oncogenic virus, administering a first therapy tailored for treatment of gastric cancer associated with an EBV infection, and when the classifier result indicates that the human cancer patient is not infected with an EBV oncogenic virus, administering a second therapy tailored for treatment of gastric cancer not associated with an EBV infection. In some embodiments, the first therapy tailored for treatment of gastric cancer associated with an EBV infection includes an immune checkpoint inhibitor. In some embodiments, the second therapy tailored for treatment of gastric cancer not associated with an EBV infection includes chemotherapy. In some embodiments, the chemotherapy includes administration of a therapeutic agent selected from the group consisting of paclitaxel, carboplatin, cisplatin, 5-fluorouracil, and oxaliplatin.
In some embodiments, the method also includes treating the subject for cancer by, when the classifier result indicates that the human cancer patient is infected with the first oncogenic pathogen, administering a first therapy tailored for treatment of the first type of cancer associated with infection by the first oncogenic pathogen, and when the classifier result indicates that the human cancer patient is not infected with the first oncogenic pathogen, administering a second therapy tailored for treatment of the first type of cancer associated with an oncogenic pathogen-free status.
In some embodiments, the classifier was trained by a method including (1) obtaining a dataset comprising, for each respective subject in a plurality of subjects of a species: (i) a corresponding plurality of abundance values, wherein each respective abundance value in the corresponding plurality of abundance values quantifies a level of expression of a corresponding gene, in a plurality of genes, in a tumor sample of the respective subject, and (ii) an indication of cancer condition of the respective subject, wherein the indication of cancer condition identifies whether the respective subject has the first cancer condition or the second cancer condition, and wherein the plurality of subjects includes a first subset of subjects that are afflicted with the first cancer condition and a second subset of subjects that are afflicted with the second condition; (2) identifying the discriminating gene set using the corresponding plurality of abundance values and respective indication of the cancer condition of respective subjects in the plurality of subjects, wherein the discriminating gene set comprises a subset of the plurality of genes; and (3) using the respective abundance values for the discriminating gene set and the respective indication of cancer condition across the plurality of subjects to train a classifier to discriminate between the first cancer condition and the second cancer condition as a function of respective abundance values for the discriminating gene set.
In some embodiments, the disclosure provides methods for discriminating between a first cancer condition and a second cancer condition in a human subject, where the first cancer condition is associated with infection by an oncogenic pathogen and the second cancer condition is associated with an oncogenic pathogen-free status. Generally, the methods include obtaining abundance data, e.g., relative expression levels, for a plurality of genes that are differentially expressed in cancerous tissue associated with one or more oncogenic pathogen infections and the same type of cancerous tissue that is not associated with an oncogenic pathogen infection. The abundance data is then input into a classifier that is trained to discriminate between the first cancer condition and the second cancer condition, at least in part, based on the abundance of the genes that are differentially expressed in the two types of cancerous tissues. Examples of the training of such classifiers are shown in
Many of the embodiments described below, in conjunction with
In some embodiments, these methods include obtaining (1302) a sample of the cancerous tissue. Methods for obtaining samples of cancerous tissue are known in the art and are dependent upon the type of cancer being sampled. For example, bone marrow biopsies and isolation of circulating tumor cells can be used to obtain samples of blood cancers, endoscopic biopsies can be used to obtain samples of cancers of the digestive tract, bladder, and lungs, needle biopsies (e.g., fine-needle aspiration, core needle aspiration, vacuum-assisted biopsy, and image-guided biopsy, can be used to obtain samples of subdermal tumors, skin biopsies, e.g., shave biopsy, punch biopsy, incisional biopsy, and excisional biopsy, can be used to obtain samples of dermal cancers, and surgical biopsies can be used to obtain samples of cancers affecting internal organs of a patient.
In some embodiments, mRNA is then isolated (1304) from the sample of the cancerous tissue. Many techniques for RNA isolation from a tissue sample are known in the art. For example, acid guanidinium thiocyanate-phenol-chloroform extraction (see, for example, Chomczynski and Sacchi, Nat Protoc, 1(2):581-85 (2006), the content of which is incorporated herein by reference, in its entirety, for all purposes), and silica bead/glass fiber adsorption (see, for example, Poeckh, T. et al., Anal Biochem., 373(2):253-62 (2008), the content of which is incorporated herein by reference, in its entirety, for all purposes). The selection of any particular RNA isolation technique for use in conjunction with the embodiments described herein is well within the skill of the person having ordinary skill in the art, who will consider the tissue type, the state of the tissue, e.g., fresh, frozen, formalin-fixed, paraffin-embedded (FFPE), and the type of nucleic acid analysis that is to be performed with the RNA sample.
In some embodiments, RNA is isolated from blood samples and/or tissue sections (e.g., a tumor biopsy) using commercially available reagents, for example, proteinase K, TURBO DNase-I, and/or RNA clean XP beads. In some embodiments, the isolated RNA is subjected to a quality control protocol to determine the concentration and/or quantity of the RNA molecules, including the use of a fluorescent dye and a fluorescence microplate reader, standard spectrofluorometer, or filter fluorometer.
In some embodiments, expression data is obtained directly from the isolated mRNA, e.g., by direct RNA sequencing (314). Methods for direct RNA sequencing are well known in the art. See, for example, Ozsolak F., et al., Nature 461:814-18 (2009), and Garalde, D.R., et al., Nat Methods, 15(3):201-206 (2018), the contents of which are incorporated herein by reference, in their entireties, for all purposes.
In other embodiments, expression data is obtained through a cDNA intermediate. Accordingly, in some embodiments, the isolated RNA is used to create a cDNA library via cDNA synthesis (310). In some embodiments, cDNA libraries are prepared from isolated RNA that is purified and selected for cDNA molecule size selection using commercially available reagents, for example Roche KAPA Hyper Beads. In another example, a New England Biolabs (NEB) kit may be used.
In some embodiments, cDNA library preparation includes ligation of adapters onto the cDNA molecules. For example, UDI adapters, such as Roche SeqCap dual end adapters, or UMI adapters (for example, full length or stubby Y adapters) may be ligated to the cDNA molecules. Adapters are nucleic acid molecules that may serve as barcodes to identify cDNA molecules according to the sample from which they were derived and/or to facilitate the downstream bioinformatics processing and/or the next generation sequencing reaction. The sequence of nucleotides in the adapters may be specific to a sample in order to distinguish samples. The adapters may facilitate the binding of the cDNA molecules to anchor oligonucleotide molecules on the sequencer flow cell and may serve as a seed for the sequencing process by providing a starting point for the sequencing reaction.
cDNA libraries may be amplified and purified using reagents, for example, Axygen MAG PCR clean up beads. Then the concentration and/or quantity of the cDNA molecules may be quantified using a fluorescent dye and a fluorescence microplate reader, standard spectrofluorometer, or filter fluorometer.
In some embodiments, both for direct RNA sequencing and prior to cDNA library construction, the isolated RNA is first enriched (1308) for a desired type of RNA (e.g., mRNA) or species (e.g., specific mRNA transcripts), prior to cDNA library construction. Methods of enriching for desired RNA molecules are also well known in the art. For example, mRNA molecules can be enriched, e.g., relative to other RNA molecules in a total RNA preparation, using oligo-dT affinity techniques (see, for example, Rio, D.C., et al., Cold Spring Harb Protoc., 2010 Jul 1;2010(7), the content of which is incorporated herein by reference, in its entirety, for all purposes). Specific mRNA transcripts can also be isolated, e.g., using hybridization probes that specifically bind to one or more mRNA sequences of interest.
In some embodiments, cDNA libraries are pooled and treated with reagents to reduce off-target capture, for example Human COT-1 and/or IDT xGen Universal Blockers, before being dried in a vacufuge. Pools may then be resuspended in a hybridization mix, for example, IDT xGen Lockdown, and probes may be added to each pool, for example, IDT xGen Exome Research Panel v1.0 probes, IDT xGen Exome Research Panel v2.0 probes, other IDT probe panels, Roche probe panels, or other probes. Pools may be incubated in an incubator, PCR machine, water bath, or other temperature-modulating device to allow probes to hybridize. Pools may then be mixed with Streptavidin-coated beads or another means for capturing hybridized cDNA-probe molecules, especially cDNA molecules representing exons of the human genome. In another embodiment, polyA capture may be used. Pools may be amplified and purified once more using commercially available reagents, for example, the KAPA HiFi Library Amplification kit and Axygen MAG PCR clean up beads, respectively.
cDNA library construction from isolated mRNAs is also well known in the art. In some embodiments, cDNA library construction is performed by first-strand DNA synthesis from the isolated mRNA using a reverse transcriptase, followed by second-strand synthesis using a DNA polymerase. Example methods for cDNA synthesis are described in McConnell and Watson, 1986, FEBS Lett. 195(1-2), pp. 199-202; Lin and Ying, 2003, Methods Mol Biol. 221, pp. 129-143, and Oh et al., 2003, Exp Mol Med. 35(6), pp. 586-90, the contents of which are hereby incorporated herein by reference, in their entireties, for all purposes.
The cDNA library may also be analyzed to determine the fragment size of cDNA molecules, which may be done through gel electrophoresis techniques and may include the use of a device such as a LabChip GX Touch. Pools may be cluster amplified using a kit (for example, Illumina Paired-end Cluster Kits with PhiX-spike in). In one example, the cDNA library preparation and/or whole exome capture steps may be performed with an automated system, using a liquid handling robot (for example, a SciClone NGSx).
The library amplification may be performed on a device, for example, an Illumina C-Bot2, and the resulting flow cell containing amplified target-captured cDNA libraries may be sequenced on a next generation sequencer, for example, an Illumina HiSeq 4000 or an Illumina NovaSeq 6000 to a unique on-target depth selected by the user, for example, 300x, 400x, 500x, 10,000x, etc. The next generation sequencer may generate a FASTQ, BCL, or other file for each patient sample or each flow cell.
If two or more patient samples are processed simultaneously on the same sequencer flow cell, reads from multiple patient samples may be contained in the same BCL file initially and then divided into a separate FASTQ file for each patient. A difference in the sequence of the adapters used for each patient sample could serve the purpose of a barcode to facilitate associating each read with the correct patient sample and placing it in the correct FASTQ file.
Methods for mRNA sequencing are well known in the art. In some embodiments, the mRNA sequencing is performed by whole exome sequencing (WES). Generally, WES is performed by isolating RNA from a tissue sample, optionally selecting for desired sequences and/or depleting unwanted RNA molecules, generating a cDNA library, and then sequencing the cDNA library (1312), for example, using next generation sequencing (NGS) techniques. For a review of the use of whole exome sequencing techniques in cancer diagnosis, see, Serratì et al., 2016, Onco Targets Ther. 9, pp. 7355-7365, the content of which is hereby incorporated herein by reference, in its entirety, for all purposes.
Next generation sequencing methods are also well known in the art, including synthesis technology (Illumina), pyrosequencing (454 Life Sciences), ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing (Pacific Biosciences), sequencing by ligation (SOLiD sequencing), nanopore sequencing (Oxford Nanopore Technologies), or paired-end sequencing. In some embodiments, massively parallel sequencing is performed using sequencing-by-synthesis with reversible dye terminators.
In some embodiments, the sequence reads may be aligned to a reference exome or reference genome using known methods in the art to determine alignment position information. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. Alignment position information may also include sequence read length, which can be determined from the beginning position and end position. A region in the reference genome may be associated with a gene or a segment of a gene. Non-limited examples of well-known software for assembling and managing transcriptome information from RNA-seq data include TopHat and Cufflinks, see, Trapnell et al., 2012, Nat Protoc. 7(3), pp. 562-578, the content of which is hereby incorporated herein by reference, in its entirety, for all purposes. See, also, Hintzsche et al., 2016, Int J Genomics 7983236, the content of which is hereby incorporated herein by reference, in its entirety, for all purposes.
In other embodiments, expression data is generated by hybridization (1313) of the cDNA library, e.g., using a microarray. The use of microarray-based gene profiling to identify differential gene expression following pathogen infection is known in the art. For example, see, Adomas et al., 2008, Tree Physiol. 28(6), pp. 885-897, the content of which is hereby incorporated herein by reference, in its entirety, for all purposes. Similarly, in other embodiments, yet other methods for quantifying expression based on a cDNA library are used, for example, quantitative real-time PCR (RT-qPCR). See, for example, Wagner, 2013, Methods Mol Biol. 1027, pp. 19-45, the content of which is hereby incorporated herein by reference, in its entirety, for all purposes.
As illustrated with respect to
In some embodiments, the method includes obtaining a dataset for the subject, the dataset including a plurality of abundance values, where each respective abundance value in the plurality of abundance values quantifies a level of expression of a corresponding gene, in a plurality of genes, in a cancerous tissue from the subject. In some embodiments, the obtained abundance values are determined according to any of the methodologies described with respect to sub-method 1301. In some embodiments, the abundance data is pre-generated and communicated to computer system 1100 over a network, e.g., using network interface 1104. Method 1300 then includes inputting (1316) the dataset to a classifier trained for discriminating between a first cancer condition and a second cancer condition in a human subject, where the first cancer condition is associated with infection by an oncogenic pathogen and the second cancer condition is associated with an oncogenic pathogen-free status. Examples of such classifiers are provided above in conjunction with
In some embodiments, method 1300 also includes inputting a variant allele count for one or more variant alleles at one or more loci in the genome of the cancerous tissue from the subject into the classifier. That is, in some embodiments, the classifier is also trained against data relating to the presence or absence of one or more variant alleles in subjects with cancers that are either associated with an oncogenic pathogen infection or not associated with an oncogenic pathogen infection. In some embodiments, the one or more variant alleles are selected from variant alleles in a gene selected from the group consisting of TP53 (ENSG00000141510), CDKN2A (ENSG00000147889), and PIK3CA (ENSG00000121879).
In some embodiments, the subject is afflicted with breast cancer, lung cancer, prostate cancer, colorectal cancer, renal cancer, uterine cancer, pancreatic cancer, esophagus cancer, head/neck cancer, ovarian cancer, hepatobiliary cancer, cervical cancer, thyroid cancer, or bladder cancer.
In some embodiments, the first cancer condition is associated with infection by a first oncogenic pathogen selected from Epstein-Barr virus (EBV), hepatitis B virus (HBV), hepatitis C virus (HCV), human papilloma virus (HPV), human T-cell lymphotropic virus (HTLV-1), Kaposi’s associated sarcoma virus (KSHV), and Merkel cell polyomavirus (MCV).
More specifically, in some embodiments, the first cancer condition is selected from cervical cancer associated with human papilloma virus (HPV), head and neck cancer associated with HPV, gastric cancer associated with Epstein-Barr virus (EBV), nasopharyngeal cancer associated with EBV, Burkitt lymphoma associated with EBV, Hodgkin lymphoma associated with EBV, liver cancer associated with hepatitis B virus (HBV), liver cancer associated with hepatitis C virus (HCV), Kaposi sarcoma associated with Kaposi’s associated sarcoma virus (KSHV), adult T-cell leukemia/lymphoma associated with human T-cell lymphotropic virus (HTLV-1), and Merkel cell carcinoma associated with Merkel cell polyomavirus (MCV). For a summary of cancer conditions known to be associated with an oncogenic viral infection, see, de Flora, 2011, “The prevention of infection-associated cancers,” Carcinogenesis 32, pp. 787-795.
Accordingly, when the first cancer condition is a particular type of cancer associated with a particular oncogenic pathogen, the second cancer condition is the same particular type of cancer associated with no infection of the particular oncolytic pathogen. For example, when the first cancer condition is cervical cancer associated with a human papilloma virus (HPV) infection, the second cancer condition is cervical cancer that is not associated with a human papilloma virus (HPV) infection. Further, as described above, the classifier used to discriminate between the two cancer conditions is trained against a dataset including at least gene abundance values (e.g., mRNA expression profiles) from subjects known to have cervical cancer associated with a human papilloma virus (HPV) infection and from subjects known to have cervical cancer that is not associate with a human papilloma virus (HPV) infection.
In some embodiments, the method further includes treating the subject with either a first therapy (1322) tailored for treatment of the first cancer condition, associated with the oncogenic pathogenic infection, or a second therapy (1324) tailored for treatment of the second cancer condition, not associated with the oncogenic pathogen infection.
Accordingly, in one embodiment, a method is provided for treating a cancer in a human cancer patient. The method includes determining whether the patient is infected with an oncogenic pathogen linked to the pathology of the cancer by obtaining a dataset for the patient, the dataset including a plurality of abundance values, and inputting the dataset into a classifier trained to discriminate between at least a first cancer condition associated with an infection of the oncogenic pathogen and a second cancer condition that is not associated with an infection of the oncogenic pathogen. Each abundance value in the dataset quantifies a level of expression of a corresponding gene found to be differentially expressed in cancers associated with an infection of the oncogenic pathogen and cancers that are not associated with an infection of the oncogenic pathogen. In some embodiments, the genes for which abundance values are used to discriminate between cancer conditions for any particular type of cancer are selected according to any of the selection methodologies described above with reference to
In some embodiments, when the subject is determined to have a first cancer condition, associated with an oncogenic pathogen infection, the method includes assigning and/or administering immunotherapy to the subject. In some embodiments, when the subject is determined to have a second cancer condition, that is not associated with an oncogenic pathogen infection, the method includes assigning and/or administering chemotherapy to the subject.
As summarized in Table 20, several clinical trials are ongoing for the treatment of virally associated tumors. Accordingly, in some embodiments, the methods described herein include assigning and/or administering a treatment for a particular cancer associated with a particular oncogenic viral infection, as listed in Table 20. For example, in some embodiments, upon a determination that the subject has a phase 3 cervical cancer associated with an HPV infection, the subject is assigned and/or administered a therapeutically effective dosing regimen of axalimogene filolisbac, which is a live attenuated Listeria monocytogenes transfected with plasmids encoding the HPV-16E7 protein fused to a truncated fragment of the Lm protein listeriolysin O.
In some embodiments, the methods described herein relate to classification and/or treatment of cancers known to be associated with a human papillomavirus (HPV) infection. As reported in Example 8 below, the twenty-four genes listed in Table 21, and shown in
In one embodiment, a method is provided for discriminating between a first cancer condition and a second cancer condition in a human subject, wherein the first cancer condition is associated with infection by a human papillomavirus (HPV) oncogenic virus and the second cancer condition is associated with an HPV-free status. The method includes obtaining a dataset for the subject, e.g., as described above with reference to
In some embodiments, the first cancer condition is cervical cancer associated with an HPV infection, and the second cancer condition is cervical cancer that is not associated with an HPV infection. In some embodiments, the first cancer condition is head and neck cancer associated with an HPV infection, and the second cancer condition is head and neck cancer that is not associated with an HPV infection. In some embodiments, the head and neck cancer is a specific form of head and neck cancer, e.g., hypopharyngeal cancer, laryngeal cancer, lip and oral cavity cancer, metastatic squamous neck cancer with occult primary, nasopharyngeal cancer, oropharyngeal cancer, paranasal sinus and nasal cavity cancer, or salivary gland cancer.
In some embodiments, the plurality of genes includes at least ten of the genes listed in Table 21. In some embodiments, the plurality of genes includes at least fifteen of the genes listed in Table 21. In some embodiments, the plurality of genes includes at least twenty of the genes listed in Table 21. In some embodiments, the plurality of genes includes all of the genes listed in Table 21. In some embodiment, the plurality of genes includes one or more genes that are not listed in Table 21, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more of the genes not listed in Table 21. In some embodiments, the plurality of genes includes no more than 20 genes. In some embodiments, the plurality of genes includes no more than 25 genes. In some embodiments, the plurality of genes includes no more than 50 genes. In some embodiments, the plurality of genes includes no more than 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, or 300 genes.
In some embodiments, the dataset also includes a variant allele count for one or more alleles at one or more loci in the genome of the cancerous tissue from the subject. In some embodiments, the variant allele count is either 1, representing a state in which the subject carries the variant allele, or 0, representing a state in which the subject does not carry the variant allele. In some embodiments, the variant allele is a somatic variant, originating from the germ line of the subject. In some embodiments, the variant allele is a cancer-derived variant, originating from the cancerous tissue. In some embodiments, the variant allele is located in the TP53 (ENSG00000141510) or CDKN2A (ENSG00000147889) gene.
In some embodiments, the classifier is trained for determining the HPV status of a test subject having an HPV-associated cancer selected from cervical cancer, head and neck squamous cell carcinoma, ovarian cancer, penile cancer, pharyngeal cancer, anal cancer, vaginal cancer, and vulvar cancer. In some embodiments, the classifier is trained for determining the HPV status of a test patient having a specific HPV-associated cancer, e.g., cervical cancer, head and neck squamous cell carcinoma, ovarian cancer, penile cancer, pharyngeal cancer, anal cancer, vaginal cancer, or vulvar cancer. However, as classifier training is generally improved by increasing the size of the training dataset, in some embodiments, the classifier is trained against data from patients that have two or more types of HPV-associated cancers, e.g., two, three, four, five, six, seven, or all eight of cervical cancer, head and neck squamous cell carcinoma, ovarian cancer, penile cancer, pharyngeal cancer, anal cancer, vaginal cancer, and vulvar cancer. In a particular embodiment, exemplified by Example 8, the classifier is trained against subjects having either head and neck squamous cell carcinoma or cervical cancer. However, in some embodiments, a classifier trained against patients having one or more types of HPV-associated cancer is useful for determining the HPV status of a patient having a different type of HPV-associated cancer.
In some embodiments, the features of the classifier include abundance values for a plurality of genes selected from those listed in Table 21, e.g., KRT86, CRISPLD1, DSG1, SESN3, DAMTS20, IRX1, SMC1B, CDKN2A, EFNB3, CXCL14, ZFR2, RNF212, MKRN3, SYCP2, MYL1, MYO3A, RNASE10, GALNT13, C19orf26, MUC4, PCDHGB1, CCND1, LCE1F, and KCNS1. As reported below, e.g., in reference to Example 8, these twenty-four genes were found to be differentially expressed, dependent upon the HPV status of the subject, in at least eight of the ten training sets formed from expression data of cervical or head and neck cancers with known HPV statuses in The Cancer Genome Atlas (TCGA). However, the skilled artisan will appreciate that, in some instances, the use of different training data sets may yield different results, e.g., one or more of these genes may not be informative in at least 80% of training folds and/or one or more genes found not to be informative in at least 80% of training folds in the study reported in Example 21 may be informative. These differences may arise, for example, when different criteria are used to select the training population, e.g., different inclusion and/or exclusion criteria such as cancer type, personal characteristics (e.g., age, gender, ethnicity, family history, smoking status, etc.), or simply by using a smaller or larger data set.
Accordingly, in some embodiments, the features of the classifier include at least five of the genes listed in Table 21. In some embodiments, the features of the classifier include at least ten of the genes listed in Table 21. In some embodiments, the features of the classifier include at least fifteen of the genes listed in Table 21. In some embodiments, the features of the classifier include at least twenty of the genes listed in Table 21. In some embodiments, the features of the classifier include all twenty-four of the genes listed in Table 21. In some embodiments, the features of the classifier include 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or all 24 of the genes listed in Table 21. Further, in some embodiments, the features of the classifier include the abundance values for one or more genes not listed in Table 21. In some embodiments, the features of the classifier include abundance values for 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more genes not listed in Table 21. In some embodiments, the features of the classifier include the abundance values for 1-10 genes not listed in Table 21. In some embodiments, the features of the classifier include the abundance values for 1-5 genes not listed in Table 21. In other embodiments, the features of the classifier do not include the abundance values for any genes not listed in Table 21.
Further, the skilled artisan will also appreciate that some features, e.g., abundance values for a particular gene, will be more informative than other features in a particular classifier. One measure of the predictive power of respective features in a classifier based on multiple features is the regression coefficient calculated for the features during training of the model. Regression coefficients describe the relationship between each feature and the response of the model. The coefficient value represents the mean change in the response given a one-unit increase in the feature value. As such, at least for variables of the same type, the magnitude, e.g., absolute value, of a regression coefficient is correlated with the importance of the feature in the model. That is, the higher the magnitude of the regression coefficient, the more important the variable is to the model. For instance, as reported in Example 7, in a particular support vector machine (SVM) classifier trained against the abundance values of all twenty-four of the genes listed in Table 21, as well as a variant allele status for the TP53 and CDKN2A genes, only six of the 24 genes had regression coefficients with magnitudes of at least 0.5-CDKN2A (1.13), SMC1B (1.02), EFNB3 (-0.97), KCNS1 (0.74), CCND1 (-0.65), and RNF212 (0.517).
As such, the skilled artisan may select a feature set that includes less than all of the genes listed in Table 21 based, at least in part, upon the importance of the respective features in one or more classification models. For instance, in some embodiments, one or more genes with lower predictive power in a classification model may be left out during classifier training. For example, in some embodiments, the features of the classifier include at least the gene expression features listed in Table 23 with a regression coefficient of at least 0.5, e.g., CDKN2A, SMC1B, EFNB3, KCNS1, CCND1, and RNF212. In some embodiments, the features of the classifier include at least the gene expression features listed in Table 23 with a regression coefficient of at least 0.4. In some embodiments, the features of the classifier include at least the gene expression features listed in Table 23 with a regression coefficient of at least 0.3. In some embodiments, the features of the classifier include at least the gene expression features listed in Table 23 with a regression coefficient of at least 0.2. In some embodiments, the features of the classifier include at least the gene expression features listed in Table 23 with a regression coefficient of at least 0.1.
Similarly, the size of the feature set may be affected by which features are included and/or excluded. For instance, in some embodiments, if particular features having high predictive power are included in a classification model, fewer total features may be included in the model. For instance, in some embodiments, if the abundance values for SMC1B, CDKN2A, and EFNB3 are included in the model, the abundance values for no more than two of the other genes whose abundance values are used as features in Table 23 need to be included in the model. Accordingly, in some embodiments, the features of the classifier include abundance values for SMC1B, CDKN2A, and EFNB3, and at least two other genes whose abundance values are used as features in Table 23. In some embodiments, the features of the classifier include abundance values for SMC1B, CDKN2A, and EFNB3, and at least five other genes whose abundance values are used as features in Table 23. In some embodiments, the features of the classifier include abundance values for SMC1B, CDKN2A, and EFNB3, and at least ten other genes whose abundance values are used as features in Table 23. In some embodiments, the features of the classifier include abundance values for SMC1B, CDKN2A, and EFNB3, and at least fifteen other genes whose abundance values are used as features in Table 23.
Similarly, in some embodiments, if features having high predictive power are excluded from the classification model, more of the other features may be included in the model. For instance, in some embodiments, if the abundance values for one or more of SMC1B, CDKN2A, and EFNB3 are not included in the model, the abundance values for at least fifteen of the other whose abundance values are used as features in Table 23 are included in the model. In some embodiments, if the abundance values for one or more of SMC1B, CDKN2A, and EFNB3 are not included in the model, the abundance values for at least twenty of the other genes whose abundance values are used as features in Table 23 are included in the model. In some embodiments, if the abundance values for one or more of SMC1B, CDKN2A, and EFNB3 are not included in the model, the abundance values for at least 15, 16, 17, 18, 19, 20, or all 21 of the other genes whose abundance values are used as features in Table 23 are included in the model.
Of course, other metrics are also available for evaluating the importance of a feature in a model, such as standardized regression coefficients and change in R-squared when the comparing the output of a model having the feature to the output of a model that is identical except that it lacks the feature.
When selecting a feature set, the skilled artisan will also consider the degree to which features are correlated to each other. Correlation is a statistical measure of how linearly dependent two variables are upon each other. As such, two correlated features provide duplicative information to a predictive model, which can be detrimental to a classifier. As such, there are several reasons why a correlated feature may be excluded from a model. For instance, removing a correlated feature will make the algorithm faster, as the larger the number of features in a classifier the more computations that need to be made. Removing a correlated feature may also remove harmful bias, arising from the correlation, from a model. Finally, removing a correlated feature may make the model more interpretable.
As such, the skilled artisan may select a feature set that includes less than all of the genes listed in Table 21 based, at least in part, upon the correlation between respective features in one or more classification models. In some embodiments, the selection to remove one or the other feature of a correlated feature set is informed by predictive powers of the two features, e.g., their respective regression coefficients. For example, the gene expression values for ENSG00000105278 (CXCL14) and ENSG00000077935 (SMC1B) are highly correlated in the feature set listed in Table 21 (correlation = 0.718983175). Accordingly, in some embodiments, the feature set does not include either CXCL14 or SMC1B. In some embodiments, CXCL14, rather than SMC1B is excluded from the feature set because, as reported in Table 23, SMC1B has a higher regression coefficient (1.02) than CXCL14 (-0.29) in the SVM model described in Example 3.
As reported in Table 24, ten pairs of gene expression features have a correlation of at least 0.6. Accordingly, in some embodiments, a feature in at least one pair of features having a correlation of at least 0.6 is excluded from the model. In some embodiments, a feature in at least two pairs of features having a correlation of at least 0.6 is excluded from the model. In other embodiments, a feature in at least 3, 4, 5, 6, 7, 8, 9, or all 10 pairs of features having a correlation of at least 0.6 is excluded from the model. In some embodiments, an excluded feature is the feature in a pair of highly correlated features having the lower regression coefficient reported in Table 23. For instance, with reference to Table 24, the feature having the lower regression coefficient in each highly correlated pair (e.g., corresponding to a correlation of at least 0.6) are:
However, in some embodiments, this selection process does not allow both features of a highly correlated pair of features to be excluded from the feature set, e.g., on the basis that both genes are the least informative feature in at least one of the highly correlated pairs of features. Thus, in some embodiments, one or more of SYCP2, MYO3A,and KCNS1 are not excluded from the feature set. Similarly, in some embodiments, this selection process does not allow highly informative features, e.g., features with regression coefficients of at least 0.5, to be excluded from the feature set. Thus, in some embodiments, one or both of RNF212 and KCNS1 are not excluded from the feature set.
Accordingly, in one embodiment, the feature set includes abundance values for at least KRT86, CRISPLD1, SESN3, DAMTS20, IRX1, SMC1B, CDKN2A, EFNB3, CXCL14, MKRN3, SYCP2, MYL1, MYO3A,RNASE10, GALNT13, C19orf26, MUC4, PCDHGB1, CCND1, LCE1F, and KCNS1.
Similarly, in one embodiment, the feature set includes abundance values for at least KRT86, CRISPLD1, SESN3, DAMTS20, IRX1, SMC1B, CDKN2A, EFNB3, CXCL14, RNF212, MKRN3, MYL1, RNASE10, GALNT13, C19orf26, MUC4, PCDHGB1, CCND1, LCE1F, and KCNS1.
Similarly, in one embodiment, the feature set includes abundance values for at least KRT86, CRISPLD1, SESN3, DAMTS20, IRX1, SMC1B, CDKN2A, EFNB3, CXCL14, RNF212, MKRN3, SYCP2, MYL1, MYO3A,RNASE10, GALNT13, C19orf26, MUC4, PCDHGB1, CCND1, LCE1F, and KCNS1.
In some embodiments, as described above referring to
In some embodiments, the classifier has a specificity of at least 70% and a sensitivity of at least 70% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 75% and a sensitivity of at least 75% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 80% and a sensitivity of at least 80% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 85% and a sensitivity of at least 85% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 90% and a sensitivity of at least 90% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 95% and a sensitivity of at least 95% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or higher for a validation data set of at least 50 data constructs. In some embodiments, the classifier has a sensitivity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or higher for a validation data set of at least 50 data constructs.
In some embodiments, the classifier has a specificity of at least 70% and a sensitivity of at least 70% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 75% and a sensitivity of at least 75% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 80% and a sensitivity of at least 80% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 85% and a sensitivity of at least 85% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 90% and a sensitivity of at least 90% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 95% and a sensitivity of at least 95% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or higher for a validation data set of at least 100 data constructs. In some embodiments, the classifier has a sensitivity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or higher for a validation data set of at least 100 data constructs.
In some embodiments, the method further includes assigning therapy and/or administering therapy to the subject based on the classification of the cancer condition, e.g., based on whether or not the subject’s cancer is associated with an HPV viral infection.
Accordingly, in one embodiment, a method is provided for treating cervical cancer in a human cancer patient. The method includes determining whether the human cancer patient is infected with a human papillomavirus (HPV) oncogenic virus by obtaining a dataset for the human cancer patient, the dataset including a plurality of abundance values where each respective abundance value in the plurality of abundance values quantifies a level of expression of a corresponding gene, in a plurality of genes, and the plurality of genes includes at least five genes selected from the genes listed in Table 21. The method then includes inputting the dataset to a classifier trained to discriminate between at least a first cervical cancer condition associated with HPV infection and a second cervical cancer condition associated with an HPV-free status based on the abundance values of the plurality of genes, in a cancerous tissue of the subject. In some embodiments, the classifier is trained according to a methodology described above, referring to
In some embodiments, the plurality of genes includes at least ten of the genes listed in Table 21. In some embodiments, the plurality of genes includes at least fifteen of the genes listed in Table 21. In some embodiments, the plurality of genes includes at least twenty of the genes listed in Table 21. In some embodiments, the plurality of genes includes all of the genes listed in Table 21. In some embodiment, the plurality of genes includes one or more genes that are not listed in Table 21, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more of the genes not listed in Table 21. In some embodiments, the plurality of genes includes no more than 20 genes. In some embodiments, the plurality of genes includes no more than 25 genes. In some embodiments, the plurality of genes includes no more than 50 genes. In some embodiments, the plurality of genes includes no more than 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, or 300 genes.
In some embodiments, the dataset also includes a variant allele count for one or more alleles at one or more loci in the genome of the cancerous tissue from the subject. In some embodiments, the variant allele count is either 1, representing a state in which the subject carries the variant allele, or 0, representing a state in which the subject does not carry the variant allele. In some embodiments, the variant allele is a somatic variant, originating from the germ line of the subject. In some embodiments, the variant allele is a cancer-derived variant, originating from the cancerous tissue. In some embodiments, the variant allele is located in the TP53 (ENSG00000141510) or CDKN2A (ENSG00000147889) gene.
In some embodiments, as described above referring to
In some embodiments, the first therapy tailored for treatment of cervical cancer associated with an HPV infection is a therapeutic vaccine. In some embodiments, the therapeutic vaccine is selected from axalimogene filolisbac (Advaxis), TG4001 (Transgene), GX-188E (Genexine), VGX-3100 (Inovio), MEDI-0457 (Inovio), INO-3106 (Inovio), TA-CIN (Cancer Research Technology), TA-HPV (Cancer Research Technology), ISA-101 (Isa), and PepCan (University of Arkansas).
In some embodiments, the first therapy tailored for treatment of cervical cancer associated with an HPV infection is an adoptive cell therapy. In some embodiments, adoptive cell therapy includes the administration of HPV-specific T cells, for example, as described for clinical trial ID NCT02379520 or NCT03197025 (Baylor College of Medicine).
In some embodiments, the first therapy tailored for treatment of cervical cancer associated with an HPV infection is an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is nivolumab (Bristol-Myers Squibb).
In some embodiments, the first therapy tailored for treatment of cervical cancer associated with an HPV infection is a PI3K inhibitor. In some embodiments, the PI3K inhibitor is AMG319 (Amgen) or BKM120 (Novartis).
Similarly, in one embodiment, a method is provided for treating head and neck cancer in a human cancer patient. The method includes determining whether the human cancer patient is infected with a human papillomavirus (HPV) oncogenic virus by obtaining a dataset for the human cancer patient, the dataset including a plurality of abundance values where each respective abundance value in the plurality of abundance values quantifies a level of expression of a corresponding gene, in a plurality of genes, and the plurality of genes includes at least five genes selected from the genes listed in Table 21. The method then includes inputting the dataset to a classifier trained to discriminate between at least a first head and neck cancer condition associated with HPV infection and a second head and neck cancer condition associated with an HPV-free status based on the abundance values of the plurality of genes, in a cancerous tissue of the subject. In some embodiments, the classifier is trained according to a methodology described above, referring to
In some embodiments, the plurality of genes includes at least ten of the genes listed in Table 21. In some embodiments, the plurality of genes includes at least fifteen of the genes listed in Table 21. In some embodiments, the plurality of genes includes at least twenty of the genes listed in Table 21. In some embodiments, the plurality of genes includes all of the genes listed in Table 21. In some embodiment, the plurality of genes includes one or more genes that are not listed in Table 21, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more of the genes not listed in Table 21. In some embodiments, the plurality of genes includes no more than 20 genes. In some embodiments, the plurality of genes includes no more than 25 genes. In some embodiments, the plurality of genes includes no more than 50 genes. In some embodiments, the plurality of genes includes no more than 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, or 300 genes.
In some embodiments, the dataset also includes a variant allele count for one or more alleles at one or more loci in the genome of the cancerous tissue from the subject. In some embodiments, the variant allele count is either 1, representing a state in which the subject carries the variant allele, or 0, representing a state in which the subject does not carry the variant allele. In some embodiments, the variant allele is a somatic variant, originating from the germ line of the subject. In some embodiments, the variant allele is a cancer-derived variant, originating from the cancerous tissue. In some embodiments, the variant allele is located in the TP53 (ENSG00000141510) or CDKN2A (ENSG00000147889) gene.
In some embodiments, as described above referring to
In some embodiments, the first therapy tailored for treatment of head and neck cancer associated with an HPV infection is a therapeutic vaccine. In some embodiments, the therapeutic vaccine is selected from axalimogene filolisbac (Advaxis), TG4001 (Transgene), GX-188E (Genexine), VGX-3100 (Inovio), MEDI-0457 (Inovio), INO-3106 (Inovio), TA-CIN (Cancer Research Technology), TA-HPV (Cancer Research Technology), ISA-101 (Isa), and PepCan (University of Arkansas).
In some embodiments, the first therapy tailored for treatment of head and neck cancer associated with an HPV infection is an adoptive cell therapy. In some embodiments, adoptive cell therapy includes the administration of HPV-specific T cells, for example, as described for clinical trial ID NCT02379520 or NCT03197025 (Baylor College of Medicine).
In some embodiments, the first therapy tailored for treatment of head and neck cancer associated with an HPV infection is an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is nivolumab (Bristol-Myers Squibb).
In some embodiments, the first therapy tailored for treatment of head and neck cancer associated with an HPV infection is a PI3K inhibitor. In some embodiments, the PI3K inhibitor is AMG319 (Amgen) or BKM120 (Novartis).
In some embodiments, the present disclosure provides probes for binding, enriching, and or detecting nucleic acid molecules, e.g., mRNA transcripts that are isolated from a cancerous tissue sample from a subject and/or cDNA molecules prepared from those mRNA transcripts, that are informative of whether the subject has a first cancer condition associated with an HPV oncogenic viral infection or a second cancer condition that is not associated with an HPV oncogenic viral infection. Generally, the probes include DNA, RNA, or a modified nucleic acid structure with a base sequence that is complementary of a nucleic acid molecule of interest. Accordingly, when the probe is designed to hybridize to an mRNA molecule isolated from the cancerous tissue, the probe will include a nucleic acid sequence that is complementary to the coding strand of the gene from which the transcript originated, i.e., the probe will include an antisense sequence of the gene. However, when the probe is designed to hybridize to a cDNA molecule, the probe can contain either a sequence that is complementary to the coding sequence of the gene of interest (an antisense sequence) or a sequence that is identical to the coding sequence of the gene of interest (a sense sequence), because the molecules in the cDNA library are double stranded.
In some embodiments, the probes include additional nucleic acid sequences that do not share any homology to the gene sequence of interest. For example, in some embodiments, the probes also include nucleic acid sequences containing an identifier sequence, e.g., a unique molecular identifier (UMI), e.g., that is unique to a particular cancerous tissue sample or cancer patient. Examples of identifier sequences are described, for example, in Kivioja et al., 2011, Nat. Methods 9(1), pp. 72-74 and Islam et al., 2014, Nat. Methods 11(2), pp. 163-66, the contents of which are hereby incorporated herein by reference, in their entireties, for all purposes. Similarly, in some embodiments, the probes also include primer nucleic acid sequences useful for amplifying the nucleic acid molecule of interest, e.g., using PCR. In some embodiments, the probes also include a capture sequence designed to hybridize to an anti-capture sequence for recovering the nucleic acid molecule of interest from the sample.
Likewise, in some embodiments, the probe includes a non-nucleic acid affinity moiety covalently attached to nucleic acid molecule that is complementary to the gene of interest, for recovering the nucleic acid molecule of interest. Non-limited examples of non-nucleic acid affinity moieties include biotin, digoxigenin, and dinitrophenol. In some embodiments, the probe is attached to a solid-state surface or particle, e.g., a dip-stick or magnetic bead, for recovering the nucleic acid of interest.
Accordingly, in one embodiment, the disclosure provides a plurality of nucleic acid probes for discriminating between a first cancer condition and a second cancer condition in a human subject, where the first cancer condition is associated with infection by a human papillomavirus (HPV) oncogenic virus and the second cancer condition is associated with an HPV-free status. The plurality of nucleic acid probes includes at least five nucleic acid probes, and each of the at least five nucleic acid probes includes a respective nucleic acid sequence that is identical or complementary to at least 10 consecutive bases of an RNA transcript of a different respective gene selected from the genes listed in Table 21.
In some embodiments, the plurality of nucleic acid probes includes at least ten probes with sequences that are complementary to or identical to sequences from different genes listed in Table 21. In some embodiments, the plurality of nucleic acid probes includes at least fifteen probes with sequences that are complementary to or identical to sequences from different genes listed in Table 21. In some embodiments, the plurality of nucleic acid probes includes at least twenty probes with sequences that are complementary to or identical to sequences from different genes listed in Table 21. In some embodiments, the plurality of nucleic acid probes includes probes with sequences that are complementary to or identical to sequences from all of the genes listed in Table 21. In some embodiments, the plurality of nucleic acid probes includes 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or 24 probes with sequences that are complementary to or identical to sequences from different genes listed in Table 21.
In some embodiments, the plurality of nucleic acid probes includes one or more probes that bind to a sequence of a gene that is not listed in Table 21. In some embodiments, the plurality of nucleic acid probes includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, or more probes that bind to a sequence of a gene that is not listed in Table 21. In some embodiments, the plurality of nucleic acid probes includes probes with sequences that bind to no more than 20 genes. In some embodiments, the plurality of nucleic acid probes includes probes with sequences that bind to no more than 25 genes. In some embodiments, the plurality of nucleic acid probes includes probes with sequences that bind to no more than 50 genes. In some embodiments, the plurality of nucleic acid probes includes probes with sequences that bind to no more than 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, or 300 genes.
In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 15 consecutive bases of an RNA transcript of interest, e.g., a transcript from a gene listed in Table 21. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 30 consecutive bases of an RNA transcript of interest, e.g., a transcript from a gene listed in Table 21. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 50 consecutive bases of an RNA transcript of interest, e.g., a transcript from a gene listed in Table 21. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, or more consecutive bases of an RNA transcript of interest, e.g., a transcript from a gene listed in Table 21.
In some embodiments, the methods described herein relate to classification and/or treatment of cancers known to be associated with an Epstein-Barr virus (EBV) infection. As reported in Example 4, below, the twenty-four genes listed in Table 22, and shown in
In one embodiment, a method is provided for discriminating between a first cancer condition and a second cancer condition in a human subject, wherein the first cancer condition is associated with infection by an Epstein-Barr virus (EBV) oncogenic virus and the second cancer condition is associated with an EBV-free status. The method includes obtaining a dataset for the subject, e.g., as described above with reference to
In some embodiments, the plurality of genes includes all of the genes listed in Table 22. In some embodiment, the plurality of genes includes one or more genes that are not listed in Table 22, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more of the genes not listed in Table 22. In some embodiments, the plurality of genes includes no more than 20 genes. In some embodiments, the plurality of genes includes no more than 25 genes. In some embodiments, the plurality of genes includes no more than 50 genes. In some embodiments, the plurality of genes includes no more than 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, or 300 genes.
In some embodiments, the dataset also includes a variant allele count for one or more alleles at one or more loci in the genome of the cancerous tissue from the subject. In some embodiments, the variant allele count is either 1, representing a state in which the subject carries the variant allele, or 0, representing a state in which the subject does not carry the variant allele. In some embodiments, the variant allele is a somatic variant, originating from the germ line of the subject. In some embodiments, the variant allele is a cancer-derived variant, originating from the cancerous tissue. In some embodiments, the variant allele is located in the TP53 (ENSG00000141510) or PIK3CA (ENSG00000121879) gene.
In some embodiments, the classifier is trained for determining the EBV status of a test subject having an EBV-associated cancer selected from Burkitt’s lymphoma, sinonasal angiocentric T-cell lymphoma, non-Hodgkin’s lymphoma, Hodgkin’s lymphoma, nasopharyngeal carcinoma, and gastric cancer. In some embodiments, the classifier is trained for determining the EBV status of a test patient having a specific EBV-associated cancer, e.g., Burkitt’s lymphoma, sinonasal angiocentric T-cell lymphoma, non-Hodgkin’s lymphoma, Hodgkin’s lymphoma, nasopharyngeal carcinoma, or gastric cancer. However, as classifier training is generally improved by increasing the size of the training dataset, in some embodiments, the classifier is trained against data from patients that have two or more types of EBV-associated cancers, e.g., two, three, four, five, or all six of Burkitt’s lymphoma, sinonasal angiocentric T-cell lymphoma, non-Hodgkin’s lymphoma, Hodgkin’s lymphoma, nasopharyngeal carcinoma, and gastric cancer. In a particular embodiment, exemplified by Example 4, the classifier is trained against patients having gastric cancer. However, in some embodiments, a classifier trained against patients having one or more types of EBV-associated cancer is useful for determining the EBV status of a patient having a different type of EBV-associated cancer.
In some embodiments, the features of the classifier include abundance values for a plurality of genes selected from those listed in Table 22, e.g., SCNN1A, CDX1, KCNK15, PRKCG, KRT7, NKD2, GPR158, CLDN3, and ZNF683. As reported below, e.g., in reference to Example 4, these nine genes were found to be differentially expressed, dependent upon the EBV status of the subject, in at least 80% of the gastric cancer training sets in The Cancer Genome Atlas (TCGA). However, the skilled artisan will appreciate that, is some instances, the use of different training data sets may yield different results, e.g., one or more of these genes may not be informative in at least 80% of training folds and/or one or more genes found not to be informative in at least 80% of training folds in the study reported in Example 4 may be informative. These differences may arise, for example, when different criteria are used to select the training population, e.g., different inclusion and/or exclusion criteria such as cancer type, personal characteristics (e.g., age, gender, ethnicity, family history, smoking status, etc.), or simply by using a smaller or larger data set.
Accordingly, in some embodiments, the features of the classifier include at least five of the genes listed in Table 22. In some embodiments, the features of the classifier include at least six of the genes listed in Table 22. In some embodiments, the features of the classifier include at least seven of the genes listed in Table 22. In some embodiments, the features of the classifier include at least eight of the genes listed in Table 22. In some embodiments, the features of the classifier include all nine of the genes listed in Table 22. Further, in some embodiments, the features of the classifier also include the abundance values for one or more genes not listed in Table 22. In some embodiments, the features of the classifier include the abundance value for 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more genes not listed in Table 22. In some embodiments, the features of the classifier include the abundance values for 1-10 genes not listed in Table 22. In some embodiments, the features of the classifier include 1-5 genes not listed in Table 22. In other embodiments, the features of the classifier do not include the abundance values for any genes not listed in Table 22.
Further, the skilled artisan will also appreciate that some features, e.g., abundance values for a particular gene, will be more informative than other features in a particular classifier. One measure of the predictive power of respective features in a classifier based on multiple features is the regression coefficient calculated for the features during training of the model. Regression coefficients describe the relationship between each feature and the response of the model. The coefficient value represents the mean change in the response given a one-unit increase in the feature value. As such, at least for variables of the same type, the magnitude, e.g., absolute value, of a regression coefficient is correlated with the importance of the feature in the model. That is, the higher the magnitude of the regression coefficient, the more important the variable is to the model. For instance, as reported in Example 4, in a particular support vector machine (SVM) classifier trained against the abundance values of all nine of the genes listed in Table 22, as well as a variant allele status for the TP53 and PIK3CA genes, only four of the nine genes had regression coefficients with magnitudes of at least 0.75-SCNN1A (-1.26), KCNK15 (-1.04), KRT7 (-0.94), and CLDN3 (-1.68).
As such, the skilled artisan may select a feature set that includes less than all of the genes listed in Table 22 based, at least in part, upon the importance of the respective features in one or more classification models. For instance, in some embodiments, one or more genes with lower predictive power in a classification model may be left out during classifier training. For example, in some embodiments, the features of the classifier include at least the gene expression features listed in Table 23 with a regression coefficient of at least 0.75, e.g., SCNN1A (-1.26), KCNK15 (-1.04), KRT7 (-0.94), and CLDN3 (-1.68). In some embodiments, the features of the classifier include at least the gene expression features listed in Table 23 with a regression coefficient of at least 0.6.
Similarly, the size of the feature set may be affected by which features are included and/or excluded. For instance, in some embodiments, if particular features having high predictive power are included in a classification model, fewer total features may be included in the model. For instance, in some embodiments, if the abundance values for SCNN1A, KCNK15, KRT7, and CLDN3 are included in the model, the abundance values for no more than one of the other genes listed in Table 22 need to be included in the model. Accordingly, in some embodiments, the features of the classifier include abundance values for SCNN1A, KCNK15, KRT7, and CLDN3, and at least one other gene listed in Table 22. In some embodiments, the features of the classifier include abundance values for SCNN1A, KCNK15, KRT7, and CLDN3, and at least two other genes listed in Table 22. In some embodiments, the features of the classifier include abundance values for SCNN1A, KCNK15, KRT7, and CLDN3, and at least three other genes listed in Table 22. In some embodiments, the features of the classifier include abundance values for SCNN1A, KCNK15, KRT7, and CLDN3, and at least four other genes listed in Table 22.
Similarly, in some embodiments, if features having high predictive power are excluded from the classification model, more of the other features may be included in the model. For instance, in some embodiments, if the abundance values for one or more of SCNN1A, KCNK15, KRT7, and CLDN3 are not included in the model, the abundance values for at least four of the other genes listed in Table 22 are included in the model. In some embodiments, if the abundance values for one or more of SCNN1A, KCNK15, KRT7, and CLDN3 are not included in the model, the abundance values for all five of the other genes listed in Table 22 are included in the model.
Of course, other metrics are also available for evaluating the importance of a feature in a model, such as standardized regression coefficients and change in R-squared when the comparing the output of a model having the feature to the output of a model that is identical except that it lacks the feature.
When selecting a feature set, the skilled artisan will also consider the degree to which features are correlated to each other. Correlation is a statistical measure of how linearly dependent two variables are upon each other. As such, two correlated features provide duplicative information to a predictive model, which can be detrimental to a classifier. As such, there are several reasons why a correlated feature may be excluded from a model. For instance, removing a correlated feature will make the algorithm faster, as the larger the number of features in a classifier the more computations that need to be made. Removing a correlated feature may also remove harmful bias, arising from the correlation, from a model. Finally, removing a correlated feature may make the model more interpretable. As such, the skilled artisan may select a feature set that includes less than all of the genes listed in Table 21 based, at least in part, upon the correlation between respective features in one or more classification models. For example, statistical analysis of the SVM model trained in Example 4 revealed that the gene expression values for ENSG00000135480 (KRT7) and ENSG00000124249 (KCNK15) were highly correlated (0.650). Accordingly, in some embodiments, the abundance value for one of KRT7 and KCNK15 are excluded from the feature set.
For example, in one embodiment, the feature set includes abundance values for at least SCNN1A, CDX1, KCNK15, PRKCG, NKD2, GPR158, CLDN3, and ZNF683. In another embodiment, the feature set includes abundance values for at least SCNN1A, CDX1, PRKCG, KRT7, NKD2, GPR158, CLDN3, and ZNF683.
In some embodiments, as described above referring to
In some embodiments, the classifier has a specificity of at least 70% and a sensitivity of at least 70% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 75% and a sensitivity of at least 75% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 80% and a sensitivity of at least 80% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 85% and a sensitivity of at least 85% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 90% and a sensitivity of at least 90% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 95% and a sensitivity of at least 95% for a validation data set of at least 50 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or higher for a validation data set of at least 50 data constructs. In some embodiments, the classifier has a sensitivity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or higher for a validation data set of at least 50 data constructs.
In some embodiments, the classifier has a specificity of at least 70% and a sensitivity of at least 70% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 75% and a sensitivity of at least 75% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 80% and a sensitivity of at least 80% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 85% and a sensitivity of at least 85% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 90% and a sensitivity of at least 90% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 95% and a sensitivity of at least 95% for a validation data set of at least 100 data constructs, e.g., where none of the data constructs in the validation data set were used in the training of the classifier. In some embodiments, the classifier has a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or higher for a validation data set of at least 100 data constructs. In some embodiments, the classifier has a sensitivity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or higher for a validation data set of at least 100 data constructs.
In some embodiments, the method further includes assigning therapy and/or administering therapy to the subject based on the classification of the cancer condition, e.g., based on whether or not the subject’s cancer is associated with an EBV viral infection.
Accordingly, in one embodiment, a method is provided for treating gastric cancer in a human cancer patient. The method includes determining whether the human cancer patient is infected with an Epstein-Barr virus (EBV) oncogenic virus by obtaining a dataset for the human cancer patient, the dataset including a plurality of abundance values where each respective abundance value in the plurality of abundance values quantifies a level of expression of a corresponding gene, in a plurality of genes, and the plurality of genes includes at least five genes selected from the genes listed in Table 22. The method then includes inputting the dataset to a classifier trained to discriminate between at least a first gastric cancer condition associated with an EBV infection and a second gastric cancer condition associated with an EBV-free status based on the abundance values of the plurality of genes, in a cancerous tissue of the subject. In some embodiments, the classifier is trained according to a methodology described above, referring to
In some embodiments, the plurality of genes includes all of the genes listed in Table 22. In some embodiment, the plurality of genes includes one or more genes that are not listed in Table 22, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more of the genes not listed in Table 22. In some embodiments, the plurality of genes includes no more than 20 genes. In some embodiments, the plurality of genes includes no more than 25 genes. In some embodiments, the plurality of genes includes no more than 50 genes. In some embodiments, the plurality of genes includes no more than 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, or 300 genes.
In some embodiments, the dataset also includes a variant allele count for one or more alleles at one or more loci in the genome of the cancerous tissue from the subject. In some embodiments, the variant allele count is either 1, representing a state in which the subject carries the variant allele, or 0, representing a state in which the subject does not carry the variant allele. In some embodiments, the variant allele is a somatic variant, originating from the germ line of the subject. In some embodiments, the variant allele is a cancer-derived variant, originating from the cancerous tissue. In some embodiments, the variant allele is located in the TP53 (ENSG00000141510) or PIK3CA (ENSG00000121879) gene.
In some embodiments, as described above referring to
In some embodiments, the first therapy tailored for treatment of gastric cancer associated with an EBV infection is an adoptive cell therapy. In some embodiments, the adoptive cell therapy includes is ATA 129 (Atara), EBVST (Tessa), or CMD-003 (Cell Medica).
In some embodiments, the first therapy tailored for treatment of gastric cancer associated with an EBV infection is an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is Pembrozilumab (Merck) or nivolumab (Bristol-Myers Squibb).
In some embodiments, the first therapy tailored for treatment of gastric cancer associated with an EBV infection is a BTK inhibitor. In some embodiments, the BTK inhibitor is ibrutinib (Pharmacyclics).
In some embodiments, the present disclosure provides probes for binding, enriching, and or detecting nucleic acid molecules, e.g., mRNA transcripts that are isolated from a cancerous tissue sample from a subject and/or cDNA molecules prepared from those mRNA transcripts, that are informative of whether the subject has a first cancer condition associated with an EBV oncogenic viral infection or a second cancer condition that is not associated with an EBV oncogenic viral infection. Generally, the probes include DNA, RNA, or a modified nucleic acid structure with a base sequence that is complementary of a nucleic acid molecule of interest. Accordingly, when the probe is designed to hybridize to an mRNA molecule isolated from the cancerous tissue, the probe will include a nucleic acid sequence that is complementary to the coding strand of the gene from which the transcript originated, e.g., the probe will include an antisense sequence of the gene. However, when the probe is designed to hybridize to a cDNA molecule, the probe can contain either a sequence that is complementary to the coding sequence of the gene of interest (an antisense sequence) or a sequence that is identical to the coding sequence of the gene of interest (a sense sequence), because the molecules in the cDNA library are double stranded.
In some embodiments, the probes include additional nucleic acid sequences that do not share any homology to the gene sequence of interest. For example, in some embodiments, the probes also include nucleic acid sequences containing an identifier sequence, e.g., a unique molecular identifier (UMI), e.g., that is unique to a particular cancerous tissue sample or cancer patient. Examples of identifier sequences are described, for example, in Kivioja et al., 2011, Nat. Methods 9(1):72-74 and Islam et al., 2014, Nat. Methods 11(2), pp. 163-66, the contents of which are incorporated herein by reference, in their entireties, for all purposes. Similarly, in some embodiments, the probes also include primer nucleic acid sequences useful for amplifying the nucleic acid molecule of interest, e.g., using PCR. In some embodiments, the probes also include a capture sequence designed to hybridize to an anti-capture sequence for recovering the nucleic acid molecule of interest from the sample.
Likewise, in some embodiments, the probe includes a non-nucleic acid affinity moiety covalently attached to nucleic acid molecule that is complementary to the gene of interest, for recovering the nucleic acid molecule of interest. Non-limited examples of non-nucleic acid affinity moieties include biotin, digoxigenin, and dinitrophenol. In some embodiments, the probe is attached to a solid-state surface or particle, e.g., a dip-stick or magnetic bead, for recovering the nucleic acid of interest.
Accordingly, in one embodiment, the disclosure provides a plurality of nucleic acid probes for discriminating between a first cancer condition and a second cancer condition in a human subject, where the first cancer condition is associated with infection by an Epstein-Barr virus (EBV) oncogenic virus and the second cancer condition is associated with an EBV-free status. The plurality of nucleic acid probes includes at least five nucleic acid probes, and each of the at least five nucleic acid probes includes a respective nucleic acid sequence that is identical or complementary to at least 10 consecutive bases of an RNA transcript of a different respective gene selected from the genes listed in Table 22.
In some embodiments, the plurality of nucleic acid probes includes at least ten probes with sequences that are complementary to or identical to sequences from different genes listed in Table 22. In some embodiments, the plurality of nucleic acid probes includes 2, 3, 4, 5, 6, 7, 8, or 9 probes with sequences that are complementary to or identical to sequences from different genes listed in Table 22.
In some embodiments, the plurality of nucleic acid probes includes one or more probes that bind to a sequence of a gene that is not listed in Table 22. In some embodiments, the plurality of nucleic acid probes includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, or more probes that bind to a sequence of a gene that is not listed in Table 22. In some embodiments, the plurality of nucleic acid probes includes probes with sequences that bind to no more than 20 genes. In some embodiments, the plurality of nucleic acid probes includes probes with sequences that bind to no more than 25 genes. In some embodiments, the plurality of nucleic acid probes includes probes with sequences that bind to no more than 50 genes. In some embodiments, the plurality of nucleic acid probes includes probes with sequences that bind to no more than 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, or 300 genes.
In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 15 consecutive bases of an RNA transcript of interest, e.g., a transcript from a gene listed in Table 22. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 30 consecutive bases of an RNA transcript of interest, e.g., a transcript from a gene listed in Table 22. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 50 consecutive bases of an RNA transcript of interest, e.g., a transcript from a gene listed in Table 22. In some embodiments, each probe in the plurality of probes includes a nucleic acid sequence that is identical or complementary to at least 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, or more consecutive bases of an RNA transcript of interest, e.g., a transcript from a gene listed in Table 22.
In some embodiments, the methods and systems described herein are performed in conjunction with sequencing of RNA molecules isolated from a biological sample of a patient. In some embodiments, a FASTQ file, or equivalent file format, of the sequencing data is the output of such a sequencing reaction.
In some embodiments, each FASTQ file contains reads that may be paired-end or single reads, and may be short-reads or long-reads, where each read shows one detected sequence of nucleotides in an mRNA molecule that was isolated from the patient sample, inferred by using the sequencer to detect the sequence of nucleotides contained in a cDNA molecule generated from the isolated mRNA molecules during library preparation. Each read in the FASTQ file is also associated with a quality rating. The quality rating may reflect the likelihood that an error occurred during the sequencing procedure that affected the associated read.
Each FASTQ file may be processed by a bioinformatics pipeline. In various embodiments, the bioinformatics pipeline may filter FASTQ data. Filtering FASTQ data may include correcting sequencer errors and removing (trimming) low quality sequences or bases, adapter sequences, contaminations, chimeric reads, overrepresented sequences, biases caused by library preparation, amplification, or capture, and other errors. Entire reads, individual nucleotides, or multiple nucleotides that are likely to have errors may be discarded based on the quality rating associated with the read in the FASTQ file, the known error rate of the sequencer, and/or a comparison between each nucleotide in the read and one or more nucleotides in other reads that has been aligned to the same location in the reference genome. Filtering may be done in part or in its entirety by various software tools. FASTQ files may be analyzed for rapid assessment of quality control and reads, for example, by a sequencing data QC software such as AfterQC, Kraken, RNA-SeQC, FastQC, (see Illumina, BaseSpace Labs or https://www.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/fastqc.html), or another similar software program. For paired-end reads, reads may be merged.
For each FASTQ file, each read in the file may be aligned to the location in the reference genome having a sequence that best matches the sequence of nucleotides in the read. There are many software programs designed to align reads, for example, Bowtie, Burrows Wheeler Aligner (BWA), programs that use a Smith-Waterman algorithm, etc. Alignment may be directed using a reference genome (for example, GRCh38, hg38, GRCh37, other reference genomes developed by the Genome Reference Consortium, etc.) by comparing the nucleotide sequences in each read with portions of the nucleotide sequence in the reference genome to determine the portion of the reference genome sequence that is most likely to correspond to the sequence in the read. The alignment may take RNA splice sites into account. The alignment may generate a SAM file, which stores the locations of the start and end of each read in the reference genome and the coverage (number of reads) for each nucleotide in the reference genome. The SAM files may be converted to BAM files, BAM files may be sorted, and duplicate reads may be marked for deletion.
In one example, kallisto software may be used for alignment and RNA read quantification (see Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525-527 (2016), doi:10.1038/nbt.3519). In an alternative embodiment, RNA read quantification may be conducted using another software, for example, Sailfish or Salmon (see Rob Patro, Stephen M. Mount, and Carl Kingsford (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nature Biotechnology (doi:10.1038/nbt.2862) or Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods.). These RNA-seq quantification methods may not require alignment. There are many software packages that may be used for normalization, quantitative analysis, and differential expression analysis of RNA-seq data.
For each gene, the raw RNA read count for a given gene may be calculated. The raw read counts may be saved in a tabular file for each sample, where columns represent genes and each entry represents the raw RNA read count for that gene. In one example, kallisto alignment software calculates raw RNA read counts as a sum of the probability, for each read, that the read aligns to the gene. Raw counts are therefore not integers in this example.
Raw RNA read counts may then be normalized to correct for GC content and gene length, for example, using full quantile normalization and adjusted for sequencing depth, for example, using the size factor method. In one example, RNA read count normalization is conducted according to the methods disclosed in U.S. Pat. App. No. 16/581,706 or PCT19/52801, titled Methods of Normalizing and Correcting RNA Expression Data and filed Sep. 24, 2019, which are incorporated by reference herein in their entirety. The rationale for normalization is the number of copies of each cDNA molecule in the sequencer may not reflect the distribution of mRNA molecules in the patient sample. For example, during library preparation, amplification, and capture steps, certain portions of mRNA molecules may be over or under-represented due to artifacts that arise during various aspects of priming of reverse transcription caused by random hexamers, amplification (PCR enrichment), rRNA depletion, and probe binding and errors produced during sequencing that may be due to the GC content, read length, gene length, and other characteristics of sequences in each nucleic acid molecule. Each raw RNA read count for each gene may be adjusted to eliminate or reduce over- or under-representation caused by any biases or artifacts of NGS sequencing protocols. Normalized RNA read counts may be saved in a tabular file for each sample, where columns represent genes and each entry represents the normalized RNA read count for that gene.
A transcriptome value set may refer to either normalized RNA read counts or raw RNA read counts, as described above.
In some embodiments, the results of the classification described above, e.g., of whether or not the subject is afflicted with a particular oncogenic pathogen, are used to further classify a cancer status of the subject. For instance, in some embodiments, additional types of information derived from the same biological sample, a different biological sample for the individual, and/or a personal survey of the subject, are combined with the classification results to provide diagnosis, prognosis, or treatment recommendations for the subject. These additional types of information can include one or more of genomic information (e.g., sequencing information such as germline or cancer variant allele identification, copy number variation, chromosomal aberration data, etc.), exome information (e.g., gene expression data), epigenetic information (e.g., methylation data, and histone modification data), proteomic information (e.g., protein expression data), metabolome information (e.g., data on the metabolism of the subject), and personal characteristics (e.g., age, weight, smoking status, familial disease history, etc.). For instance, as shown in
Methods for classifying the cancer status of an individual are known in the art. For instance, U.S. Provisional Application Serial No. 62/855,750, filed May 31, 2019, and incorporated by reference herein, describes various methods for combining different types of data about a subject in order to classify the cancer status of the subject. In some embodiments, the methods for detecting the presence of an oncogenic pathogen described herein are combined with any of the methods for classifying the cancer status of a subject, as described in USSN 62/855,750.
In some embodiments, the methods for detecting the presence of an oncogenic pathogen described herein are integrated (5150) with a test to determine whether the subject has a type of cancer. In some embodiments, the test determines whether the subject has a type of cancer selected from one or more of breast cancer, lung cancer, prostate cancer, colorectal cancer, renal cancer, uterine cancer, pancreatic cancer, esophagus cancer, head/neck cancer, ovarian cancer, hepatobiliary cancer, cervical cancer, thyroid cancer, or bladder cancer. In some embodiments, the test determines a likelihood that the subject has a particular type of cancer, e.g., a likelihood that the subject has breast cancer, lung cancer, prostate cancer, colorectal cancer, renal cancer, uterine cancer, pancreatic cancer, esophagus cancer, head/neck cancer, ovarian cancer, hepatobiliary cancer, cervical cancer, thyroid cancer, or bladder cancer.
In some embodiments, the methods for detecting the presence of an oncogenic pathogen described herein are integrated with a test to classify a stage of a cancer in the subject, e.g., whether the subject’s cancer is stage I, stage II, stage III, or stage IV cancer. In some embodiments, the test determines the stage of a breast cancer, lung cancer, prostate cancer, colorectal cancer, renal cancer, uterine cancer, pancreatic cancer, esophagus cancer, head/neck cancer, ovarian cancer, hepatobiliary cancer, cervical cancer, thyroid cancer, or bladder cancer.
In some embodiments, the methods for detecting the presence of an oncogenic pathogen described herein are integrated with a test to classify a prognosis for a cancer in a subject, e.g., a survival rate without treatment, a survival rate with treatment, a disease-free survival rate, a cancer recursion rate, etc. In some embodiments, the prognosis is a 1-year, 2-year, 3-year, 4-year, 5-year, or 10-year prognosis, e.g., a ten year disease-free survival rate.
In some embodiments, the methods for detecting the presence of an oncogenic pathogen described herein are integrated with a test to determine a recommended treatment for a cancer in a subject. In some embodiments, the recommended treatment is dependent upon whether or not the subject is afflicted with a particular oncogenic pathogen. Examples of such conditional therapies are provided below in conjunction with
In some embodiments, when the subject is determined to have a first cancer condition, associated with an oncogenic pathogen infection, the method includes assigning and/or administering immunotherapy to the subject. In some embodiments, when the subject is determined to have a second cancer condition, that is not associated with an oncogenic pathogen infection, the method includes assigning and/or administering chemotherapy to the subject.
As summarized in Table 3, several clinical trials are ongoing for the treatment of virally associated tumors. Accordingly, in some embodiments, the methods described herein include assigning and/or administering a treatment for a particular cancer associated with a particular oncogenic viral infection, as listed in Table 3. For example, in some embodiments, upon a determination that the subject has a phase 3 cervical cancer associated with an HPV infection, the subject is assigned and/or administered a therapeutically effective dosing regimen of axalimogene filolisbac, which is a live attenuated Listeria monocytogenes transfected with plasmids encoding the HPV-16E7 protein fused to a truncated fragment of the Lm protein listeriolysin O.
Similarly, in one embodiment, a method is provided for treating cervical cancer in a human cancer patient. The method includes determining whether the human cancer patient is infected with a human papillomavirus (HPV) oncogenic virus by using a sequence read computational subtraction processes described herein. The method then includes assigning or administering treatment for the cervical cancer, based on whether or not the subject is afflicted with an HPV oncogenic virus. When it is determined that the human cancer patient is infected with an HPV oncogenic virus, a first therapy is assigned or administered that is tailored for treatment of cervical cancer associated with an HPV infection. When it is determined that the human cancer patient is not infected with an HPV oncogenic virus, a second therapy is assigned or administered that is tailored for treatment of cervical cancer not associated with an HPV infection.
In some embodiments, the first therapy tailored for treatment of cervical cancer associated with an HPV infection is a therapeutic vaccine. In some embodiments, the therapeutic vaccine is selected from axalimogene filolisbac (Advaxis), TG4001 (Transgene), GX-188E (Genexine), VGX-3100 (Inovio), MEDI-0457 (Inovio), INO-3106 (Inovio), TA-CIN (Cancer Research Technology), TA-HPV (Cancer Research Technology), ISA-101 (Isa), and PepCan (University of Arkansas).
In some embodiments, the first therapy tailored for treatment of cervical cancer associated with an HPV infection is an adoptive cell therapy. In some embodiments, adoptive cell therapy includes the administration of HPV-specific T cells, for example, as described for clinical trial ID NCT02379520 or NCT03197025 (Baylor College of Medicine).
In some embodiments, the first therapy tailored for treatment of cervical cancer associated with an HPV infection is an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is nivolumab (Bristol-Myers Squibb).
In some embodiments, the first therapy tailored for treatment of cervical cancer associated with an HPV infection is a PI3K inhibitor. In some embodiments, the PI3K inhibitor is AMG319 (Amgen) or BKM120 (Novartis).
Similarly, in one embodiment, a method is provided for treating head and neck cancer in a human cancer patient. The method includes determining whether the human cancer patient is infected with a human papillomavirus (HPV) oncogenic virus by using a sequence read computational subtraction processes described herein. The method then includes assigning or administering treatment for the head and neck cancer, based on whether or not the subject is afflicted with an HPV oncogenic virus. When it is determined that the human cancer patient is infected with an HPV oncogenic virus, a first therapy is assigned or administered that is tailored for treatment of head and neck cancer associated with an HPV infection. When it is determined that the human cancer patient is not infected with an HPV oncogenic virus, a second therapy is assigned or administered that is tailored for treatment of head and neck cancer not associated with an HPV infection.
In some embodiments, the first therapy tailored for treatment of head and neck cancer associated with an HPV infection is a therapeutic vaccine. In some embodiments, the therapeutic vaccine is selected from axalimogene filolisbac (Advaxis), TG4001 (Transgene), GX-188E (Genexine), VGX-3100 (Inovio), MEDI-0457 (Inovio), INO-3106 (Inovio), TA-CIN (Cancer Research Technology), TA-HPV (Cancer Research Technology), ISA-101 (Isa), and PepCan (University of Arkansas).
In some embodiments, the first therapy tailored for treatment of head and neck cancer associated with an HPV infection is an adoptive cell therapy. In some embodiments, adoptive cell therapy includes the administration of HPV-specific T cells, for example, as described for clinical trial ID NCT02379520 or NCT03197025 (Baylor College of Medicine).
In some embodiments, the first therapy tailored for treatment of head and neck cancer associated with an HPV infection is an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is nivolumab (Bristol-Myers Squibb).
In some embodiments, the first therapy tailored for treatment of head and neck cancer associated with an HPV infection is a PI3K inhibitor. In some embodiments, the PI3K inhibitor is AMG319 (Amgen) or BKM120 (Novartis).
In some embodiments, the method further includes assigning therapy and/or administering therapy to the subject based on the classification of the cancer condition, e.g., based on whether or not the subject’s cancer is associated with an EBV viral infection.
Accordingly, in one embodiment, a method is provided for treating gastric cancer in a human cancer patient. The method includes determining whether the human cancer patient is infected with a Epstein-Barr virus (EBV) oncogenic virus by using a sequence read computational subtraction processes described herein. The method then includes assigning or administering treatment for the gastric cancer, based on whether or not the subject is afflicted with an EBV oncogenic virus. When it is determined that the human cancer patient is infected with an EBV oncogenic virus, a first therapy is assigned or administered that is tailored for treatment of gastric cancer associated with an EBV infection. When it is determined that the human cancer patient is not infected with an EBV oncogenic virus, a second therapy is assigned or administered that is tailored for treatment of gastric cancer not associated with an EBV infection.
In some embodiments, the first therapy tailored for treatment of gastric cancer associated with an EBV infection is an adoptive cell therapy. In some embodiments, the adoptive cell therapy includes is ATA 129 (Atara), EBVST (Tessa), or CMD-003 (Cell Medica).
In some embodiments, the first therapy tailored for treatment of gastric cancer associated with an EBV infection is an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is Pembrozilumab (Merck) or nivolumab (Bristol-Myers Squibb).
In some embodiments, the first therapy tailored for treatment of gastric cancer associated with an EBV infection is a BTK inhibitor. In some embodiments, the BTK inhibitor is ibrutinib (Pharmacyclics).
In some embodiments, the method further includes assigning therapy and/or administering therapy to the subject based on the classification of the cancer condition, e.g., based on whether or not the subject’s cancer is associated with a Merkel cell polyomavirus (MCPyV) infection.
Accordingly, in one embodiment, a method is provided for treating a carcinoma in a human cancer patient. The method includes determining whether the human cancer patient is infected with a Merkel cell polyomavirus (MCPyV) oncogenic virus by using a sequence read computational subtraction processes described herein. The method then includes assigning or administering treatment for the carcinoma, based on whether or not the subject is afflicted with a MCPyV oncogenic virus. When it is determined that the human cancer patient is infected with a MCPyV oncogenic virus, a first therapy is assigned or administered that is tailored for treatment of Merkel cell carcinoma associated with a MCPyV infection. When it is determined that the human cancer patient is not infected with a MCPyV oncogenic virus, a second therapy is assigned or administered that is tailored for treatment of carcinoma not associated with a MCPyV infection.
In some embodiments, the treatment tailored to Merkel cell carcinoma is determined based on the stage of the Merkel cell carcinoma. For instance, the National Cancer Institute recommends treating stage I or stage II Merkel cell carcinoma by surgery to remove the tumor, with or without lymph node dissection, and radiation therapy after surgery. In contrast, the National Cancer Institute recommends treating stage III Merkel cell carcinoma by one or more of wide local excision with or without lymph node dissection, radiation therapy, immunotherapy for tumors that cannot be removed by surgery, e.g., immune checkpoint inhibitor therapy using pembrolizumab, a chemotherapy being evaluated in a clinical trial for Merkel cell carcinoma, and an immunotherapy being evaluated in a clinical trial for Merkel cell carcinoma, e.g., nivolumab. Similarly, the National Cancer Institute recommends treating stage IV Merkel cell carcinoma by one or more of immunotherapy, e.g., immune checkpoint inhibitor therapy using pembrolizumab or avelumab, chemotherapy, surgery or radiation therapy as palliative treatment to relieve symptoms and improve quality of life, and an immunotherapy being evaluated in a clinical trial for Merkel cell carcinoma, e.g., nivolumab and ipilimumab. Accordingly, in some embodiments, particularly when the cancer is classified as stage III or stage IV cancer, when it is determined that the human cancer patient is afflicted with a MCPyV oncogenic virus, the patient is assigned or administered immune checkpoint inhibitor therapy, for example an anti-PD1 (e.g., nivolumab, pembrolizumab, or cemiplimab), and anti-PD-L1 (e.g., atezolizumab, avelumab, or duvalumab), or an anti-CTLA-4 (e.g., ipilimumab) monoclonal antibody, and when it is determined that the human cancer patient not is afflicted with a MCPyV oncogenic virus, a therapy is assigned or administered that does not include immune checkpoint inhibitor therapy.
In some embodiments, the methods described herein further include generating (5132) a clinical report for the subject, the clinical report indicating whether the subject is afflicted with an oncogenic pathogen in the plurality of oncogenic pathogens, e.g., using patient reporting module 160.
In some embodiments, the status of the cancer condition is selected from cervical cancer associated with human papilloma virus (HPV), head and neck cancer associated with HPV, gastric cancer associated with Epstein-Barr virus (EBV), nasopharyngeal cancer associated with EBV, Burkitt lymphoma associated with EBV, Hodgkin lymphoma associated with EBV, liver cancer associated with hepatitis B virus (HBV), liver cancer associated with hepatitis C virus (HCV), Kaposi sarcoma associated with Kaposi’s associated sarcoma virus (KSHV), adult T-cell leukemia/lymphoma associated with human T-cell lymphotropic virus (HTLV-1), and Merkel cell carcinoma associated with Merkel cell polyomavirus (MCV). For a summary of cancer conditions known to be associated with an oncogenic pathogen infection, see, for example, de Flora, Carcinogenesis 32:787-95 (2011), which is incorporated herein by reference.
In some embodiments, the subject has cancer, and the clinical report further indicates a type of the cancer, where the indicated type of the cancer is dependent upon whether the subject is afflicted with an oncogenic pathogen in the plurality of oncogenic pathogens (5134). In some embodiments, the type of cancer is selected from breast cancer, lung cancer, prostate cancer, colorectal cancer, renal cancer, uterine cancer, pancreatic cancer, esophagus cancer, head/neck cancer, ovarian cancer, hepatobiliary cancer, cervical cancer, thyroid cancer, or bladder cancer. For example, in one embodiment, when the subject (i) has a B-cell lymphoma and (ii) is afflicted with Epstein-Barr virus, the clinical report indicates that the type of cancer is Epstein-Barr virus-positive mucocutaneous ulcer (EBVMCU) (5136). Similarly, approximately 10-15% of all cases of diffuse large B-cell lymphoma (DLBCL) are associated with the Epstein-Barr virus (EBV). Accordingly, in one embodiment, when the subject (i) has DLBCL and (ii) is afflicted with Epstein-Barr virus, the clinical report indicates that the type of cancer is Epstein-Barr virus-positive DLBCL (EBV + DLBCL).
Other, non-limiting examples of oncogenic pathogens that are known to be associated with specific cancers, such that detection of nucleic acid sequences from these pathogens inform a cancer diagnosis, are shown below in Table 1, above. For additional information on known associations between oncogenic pathogens and cancers see, for example, Flora and Bonanni, 2011, “The prevention of infection-associated cancers,” Carcinogenesis 32(6), pp. 787-795, which is hereby incorporated by reference.
In some embodiments, the subject has metastatic cancer, and the clinical report further indicates a primary origin of the metastatic cancer, where the indicated primary origin of the metastatic cancer is dependent upon whether the subject is afflicted with an oncogenic pathogen in the plurality of oncogenic pathogens (5138). For example, in some embodiments, when the subject (i) has metastatic squamous cell carcinoma (SCC) and (ii) is afflicted with human papillomavirus, the clinical report indicates that the primary origin of the metastatic cancer is the oropharynx (5140). Another example where the association of an oncogenic pathogen with the cancer informs assignment of the primary origin of the cancer is the presence of HPV in any gynecological cancer, which indicates that the primary origin of the cancer is the ovaries. Similarly, the presence of merkel cell polyomavirus in a melanoma indicates that the primary origin of the cancer is a merkel cell.
In some embodiments, the subject has cancer, and the clinical report further indicates a recommended treatment modality for the cancer, where the recommended treatment modality for the cancer is dependent upon whether the subject is afflicted with an oncogenic pathogen in the plurality of oncogenic pathogens (5142). For example, Epstein-Barr virus (EBV) is associated with between 10-15% of all cases of diffuse large B-cell lymphoma (DLBCL). Expression studies of EBV+ and EBV- DLBCL cases show that many genes associated with pathways that are targeted in various cancer therapies (e.g., NF-κB targets, cell cycle regulation genes, anti-apoptosis genes, tumor progression genes, cell proliferation genes, immune response genes, pro-apoptotic genes, etc.) are differentially regulated in EBV+ DLBCL, relative to EBV-DLBCL. Accordingly, it’s been proposed that EBV+ and EBV- DLBCL should be treated differently (see, for example, OK C.Y., et al., Blood, 122(3):328-40, which is incorporated herein by reference). Accordingly, in some embodiments, the subject has lymphoma, and the clinical report indicates: when the subject is determined not to be afflicted with human papillomavirus, that the recommended therapy modality is a chemotherapy or an immunotherapy; and when the subject is determined to be afflicted with human papillomavirus, that the recommended therapy modality is anti-viral therapy (5144). In some embodiments, the subject has lymphoma, and the clinical report indicates: when the subject is determined not to be afflicted with H.pylori, that the recommended therapy modality is a chemotherapy or an immunotherapy; and when the subject is determined to be afflicted with H.pylori, that the recommended therapy modality is antibiotics (5146). In another embodiment, the subject has gastric cancer, and the clinical report indicates that when the subject is afflicted with EBV, the recommended therapy is immunotherapy (e.g., immune checkpoint inhibitor therapy), and when the subject is not afflicted with EBV, the recommended therapy is chemotherapy (e.g., docetaxel, doxorubicin hydrochloride, 5-fluorouracil, fluorouracil, trifluridine and tipiracil hydrochloride, mitomycin C). In yet other embodiments, the recommended treatment modality for a subject afflicted with an oncogenic pathogen is selected from the combination of those diagnoses and treatments shown above in Table 3. Generally, current treatment guidelines for various cancers are maintained by various organizations, including the National Cancer Institute and Merck & Co., in the Merck Manual.
Further, several bacterial species, although not known to contribute to the development of cancer, have been found to confer resistance against specific cancer therapies. For instance, certain bacteria (e.g., Serratia marcescens) express enzymes (e.g., the long isoform of cytidine deaminase) capable of metabolizing gemcitabine into an inactive form. See, for instance, Geller LT et al., Science, 357(6356):1156-60 (2017), which is hereby incorporated by reference. Similarly, certain bacteria (e.g., Bacteroides fragilis) were found to interfere with the efficacy of immune checkpoint inhibitors, such as anti-CTLA-4 monoclonal antibodies. Accordingly, in some embodiments, following identification of a nucleic acid sequence from a bacteria known to confer resistance against a specific cancer therapy, the report generated for the subject indicates that a treatment modality other than the cancer therapy inhibited by the identified bacterium is recommended.
In some embodiments, subject has cancer, and the clinical report further indicates a prognosis for the cancer, where the prognosis for the cancer is dependent upon whether the subject is afflicted with an oncogenic pathogen in the plurality of oncogenic pathogens (5148). For instance, in some embodiments the cancer can be effectively treated by eradicating the underlying oncogenic pathogen infection. In such cases, the prognosis for the cancer patient may be better than for a similar cancer that is not being driven by affliction with an oncogenic pathogen. In contrast, in some embodiments, a cancer associated with an oncogenic pathogen is not as readily treatable as a similar cancer that is not associated with an oncogenic pathogen. In such cases, the prognosis for the cancer patient may be worse than for a cancer patient that is not afflicted with the oncogenic pathogen. Similarly, survival rates for oropharyngeal squamous cell carcinoma (OSCC) associated with HPV are much higher than for OSCC that is not associated with HPV.
In some embodiments, in addition to detecting oncogenic pathogens, the systems and methods described herein can also detect non-oncogenic pathogens. For example, in some embodiments, the systems and methods described herein can be used to detect a pathogen that causes an acute disorder, for example, respiratory illnesses (for example, SARS-CoV-1, SARS-CoV-2, MERS-CoV, Coronavirus HKU1, Coronavirus NL63, Coronavirus 229E, Coronavirus OC43, Influenza A, Influenza A H1, Influenza A H1-2009, Influenza A H1N1, Influenza A H3, Influenza B, Influenza C, Parainfluenza virus 1, Parainfluenza virus 2, Parainfluenza virus 3, Parainfluenza virus 4, Rhinovirus/Enterovirus, Adenovirus, Respiratory Syncytial Virus, Respiratory Syncytial Virus A, Respiratory Syncytial Virus B, Human Metapneumovirus, Bocavirus, Human Bocavirus, Chlamydophila pneumoniae, Mycoplasma pneumoniae, Legionella pneumophila, Bordetella, Bordetella holmesii, Bordetella pertussis, Streptococcus pneumoniae, Coxiella burnetii, Staphylococcus aureus, Klebsiella pneumoniae, Moraxella catarrhalis, Haemophilus influenzae, Pneumocystis jirovecii, Enterovirus D68, Epstein-Barr virus (EBV), Mumps, Measles, Cytomegalovirus, Human herpesvirus 6 (HHV-6), Varicella zoster virus (VZV), Parechovirus, etc.), gastroenteritis (for example, norovirus, rotavirus, Escherichia coli/E.coli, Salmonella, Campylobacter, parasites, etc.), meningitis (for example, Steptococcus pneumoniae, Neisseria meningitidis, Haemophilus influenzae type B/Hib), viral hemorrhagic fever (for example, arenaviruses, bunyaviruses, filoviruses, flaviviruses, etc.), cholera (Vibrio cholerae), malaria (including Plasmodium falciparum, P.vivax, P.ovale, P.malariae, P.knowlesi), tuberculosis (including Mycobacterium tuberculosis), measles (including paramyxovirus), pertussis (including Bordetella pertussis), etc.
In some embodiments, the systems and methods described herein can be used to detect a pathogen associated with a chronic disease or other type of disease, for example, hepatitis B virus, hepatitis C virus, human immunodeficiency virus (HIV), pathogens associated with liver disease (including hepatitis A, B, C, D, E virus), Lyme disease, tuberculosis, sexually transmitted diseases, antibiotic resistant bacteria (MRSA, C. difficile), etc. In some embodiments, a method described herein is performed to determine whether a subject is afflicted with an oncogenic pathogen and, at the same time, whether the subject is afflicted with a pathogen that causes an acute disorder or chronic disease. In this fashion, detection of a non-oncogenic pathogen in a sample from a subject with cancer can be reported as an incidental finding. For example, in some embodiments, such a report would alert a physician treating the subject that sequence reads of the pathogen unrelated to the cancer were detected and the patient may need additional testing to confirm the infection. This could catch chronic infections at an early stage, give the patient more treatment options, avoid organ failure and/or compromised immune system in the patient, etc.
Table 27 providing taxonomic identifiers for some of the respiratory pathogens listed above. The taxonomic identifiers can be used to find nucleic acid (genetic) sequences associated with these pathogens in one of several publicly-available databases, such as the NCBI Virus database accessible online at ncbi.nlm.nih.gov/labs/virus/vssi/#/. In various embodiments, the diagnostic test used to detect the presence of a pathogen may detect portions of a genetic sequence associated with the pathogen.
The Cancer Genome Atlas (TCGA) is a publicly available dataset comprising more than two petabytes of genomic data for over 11,000 cancer patients, including clinical information about the cancer patients, metadata about the samples (e.g. the weight of a sample portion, etc.) collected from such patients, histopathology slide images from sample portions, and molecular information derived from the samples (e.g. mRNA/miRNA expression, protein expression, copy number, etc.). The TCGA dataset includes data on 33 different cancers: breast (breast ductal carcinoma, bread lobular carcinoma) central nervous system (glioblastoma multiforme, lower grade glioma), endocrine (adrenocortical carcinoma, papillary thyroid carcinoma, paraganglioma & pheochromocytoma), gastrointestinal (cholangiocarcinoma, colorectal adenocarcinoma, esophageal cancer, liver hepatocellular carcinoma, pancreatic ductal adenocarcinoma, and stomach cancer), gynecologic (cervical cancer, ovarian serous cystadenocarcinoma, uterine carcinosarcoma, and uterine corpus endometrial carcinoma), head and neck (head and neck squamous cell carcinoma, uveal melanoma), hematologic (acute myeloid leukemia, Thymoma), skin (cutaneous melanoma), soft tissue (sarcoma), thoracic (lung adenocarcinoma, lung squamous cell carcinoma, and mesothelioma), and urologic (chromophobe renal cell carcinoma, clear cell kidney carcinoma, papillary kidney carcinoma, prostate adenocarcinoma, testicular germ cell cancer, and urothelial bladder carcinoma).
In order to test the viral detection method described herein, sequencing data was generated from total nucleic acid isolated from a tumor biopsy of a cervical cancer patient. Briefly, tumor total nucleic acid was extracted from formalin-fixed paraffin-embedded (FFPE) tumor tissue sections that were proteinase K digested. Total nucleic acid was extracted using a source-specific magnetic bead protocol. Total nucleic acid was utilized for all DNA library construction. RNA was purified from the total nucleic acid by DNaseI digestion and magnetic bead purification. Nucleic acids were quantified using commercial DNA or RNA quantification kits.
One hundred nanograms (ng) of isolated DNA was mechanically sheared to an average size of 200 base pairs (bp) using an ultrasonicator. DNA libraries were then prepared using a commercial DNA library preparation kit (e.g., a KAPA Hyper Prep Kit), and hybridized to a targeted probe set (e.g., similar to the probe set shown in
The 65 million sequence reads were then aligned to a human reference genome using the Scalable Nucleotide Alignment Program (SNAP) sequence alignment algorithm (Zaharia M., et al., arXiv:1111.5572v1 [cs.DS] 23 Nov. 2011, the content of which is incorporated by reference herein), which was completed in 383 seconds. Parameters and statistics for the alignment, as described in Zaharia et al., are shown in Table 4, below. Of the 65 million sequence reads, 93,781 reads were not aligned to the reference human genome.
The 93,781 reads that were not mapped to the human reference genome were then aligned to a comprehensive bacterial genome database (curated by the NCBI) using SNAP. This process took 517 seconds. In contrast, aligning all 65 million of the original sequence reads would have taken nearly 100 hours at the same rate. The 93,781 reads that were not mapped to the human reference genome were also aligned to a comprehensive viral genome database (curated by the NCBI) using SNAP. This process took 152 seconds. In contrast, aligning all 65 million of the original sequence reads would have taken nearly 30 hours at the same rate. Parameters and statistics for the alignment, as described in Zaharia et al., are shown in Tables 5 and 6, below.
The species of each aligned bacterial and viral sequence was determined and the number of sequence reads from each species was totaled. The final sequence read counts for each species identified are shown below in Tables 7 and 8.
Acidovorax_delafieldii
Bacteroides_fragilis
Bradyrhizobium_sp._STM_3809
Burkholderia_mallei
Candidatus_Pelagibacter_ubique
Corynebacterium_bovis
Cutibacterium_acnes
Escherichia_coli
Gordonia_alkanivorans
Mesorhizobium_alhagi
Mesorhizobium_amorphae
Microbacterium_laevaniformans
Micrococcus_luteus
Propionibacterium_sp._409-HC1
Propionibacterium_sp._434-HC2
Pseudomonas_aeruginosa
Pseudomonas_amygdali
Sphingomonas_sp._KC8
Sphingomonas_sp._S17
Staphylococcus_warneri
Verminephrobacter_aporrectodeae
Xanthomonas_citri
As shown in Table 7, the method identified 15429 Human papillomavirus (HPV) reads, 3982 Alphapapillomavirus 7 reads, and 148 Escherichia virus phiX174 reads, in addition to a low level of three other viruses: Enterobacteria phage phiX174 sensu lato, Escherichia virus alpha3, and Escherichia virus phiK. Because the number of reads for the former, but not the latter, group of viruses satisfied a predetermined threshold of at least 10 sequence reads, the cervical cancer is characterized as afflicted with Human papillomavirus (HPV) and Alphapapillomavirus 7 viral infections. Notably, Human papillomavirus (HPV) and Alphapapillomavirus 7 are known to be associated with human cancers, such that this information could be used to inform treatment of the cervical cancer. The Escherichia virus phiX174 reads can be discounted because the virus is a common contaminant in genome sequencing experiments (see, for example, Mukherjee S., et al., Stand. Genomic Sci. 10:18 (2015)), and does not infect human cells. Notably, this example highlights a case where alignment to only a panel of targeted species of oncogenic pathogen would have missed a less common Alphapapillomavirus 7 viral infection. Particularly, because two strains of papillomavirus were detected in this subject.
In order to test the viral detection method described herein, sequencing data was generated from total nucleic acid isolated from a tumor biopsy of an HNSCC cancer patient. Briefly, tumor total nucleic acid was extracted from formalin-fixed paraffin-embedded (FFPE) tumor tissue sections that were proteinase K digested. Total nucleic acid was extracted using a source-specific magnetic bead protocol. Total nucleic acid was utilized for all DNA library construction. RNA was purified from the total nucleic acid by DNaseI digestion and magnetic bead purification. Nucleic acids were quantified using commercial DNA or RNA quantification kits.
One hundred nanograms (ng) of isolated DNA was mechanically sheared to an average size of 200 base pairs (bp) using an ultrasonicator. DNA libraries were then prepared using a commercial DNA library preparation kit (e.g., a KAPA Hyper Prep Kit), and hybridized to a targeted probe set (e.g., similar to the probe set shown in
The 83 million sequence reads were then aligned to a human reference genome using the Scalable Nucleotide Alignment Program (SNAP) sequence alignment algorithm (Zaharia M., et al., arXiv:1111.5572v1 [cs.DS] 23 Nov. 2011, the content of which is incorporated by reference herein), which was completed in 366 seconds. Parameters and statistics for the alignment, as described in Zaharia et al., are shown in Table 9, below. Of the 83 million sequence reads, 414,645 reads were not aligned to the reference human genome.
The 414,645 reads that were not mapped to the human reference genome were then aligned to a comprehensive bacterial genome database (curated by the NCBI) using SNAP. This process took 464 seconds. In contrast, aligning all 83 million of the original sequence reads would have taken more than 25 hours at the same rate. The 414,645 reads that were not mapped to the human reference genome were also aligned to a comprehensive viral genome database (curated by the NCBI) using SNAP. This process took 195 second. In contrast, aligning all 65 million of the original sequence reads would have taken more than 10 hours at the same rate. Parameters and statistics for the alignments, as described in Zaharia et al., are shown in Tables 10 and 11, below.
The species of each aligned bacterial and viral sequence was determined and the number of sequence reads from each species was totaled. The final sequence read counts for each species identified are shown below in Tables 12 and 13.
Acidovorax delafieldii
Burkholderia_mallei
Candidatus Pelaqibacter ubique
Microbacterium_laevaniformans
Micrococcus_luteus
Propionibacterium_sp._409-HC1
Propionibacterium_sp._434-HC2
Vibrio tubiashii
As shown in Table 13, the method identified 1469 Human gammaherpesvirus 4 reads and 52 Escherichia virus phiX174 reads, in addition to a low level of three other viruses. Because the number of reads for the former, but not the latter, group of viruses satisfied a predetermined threshold of at least 10 sequence reads, the HNSCC cancer is characterized as afflicted with Human papillomavirus (HPV), Alphapapillomavirus 9. Notably, Human papillomavirus (HPV) and Alphapapillomavirus 9 are known to be associated with human cancers, such that this information could be used to inform treatment of the HNSCC cancer. The Escherichia virus phiX174 reads can be discounted because the virus is a common contaminant in genome sequencing experiments (see, for example, Mukherjee S., et al., Stand. Genomic Sci. 10:18 (2015)), and does not infect human cells.
In order to test the viral detection method described herein, sequencing data was generated from total nucleic acid isolated from a tumor biopsy of a colorectal cancer patient. Briefly, tumor total nucleic acid was extracted from formalin-fixed paraffin-embedded (FFPE) tumor tissue sections that were proteinase K digested. Total nucleic acid was extracted using a source-specific magnetic bead protocol. Total nucleic acid was utilized for all DNA library construction. RNA was purified from the total nucleic acid by DNaseI digestion and magnetic bead purification. Nucleic acids were quantified using commercial DNA or RNA quantification kits.
One hundred nanograms (ng) of isolated DNA was mechanically sheared to an average size of 200 base pairs (bp) using an ultrasonicator. DNA libraries were then prepared using a commercial DNA library preparation kit (e.g., a KAPA Hyper Prep Kit), and hybridized to a targeted probe set (e.g., similar to the probe set shown in
The 76 million sequence reads were then aligned to a human reference genome using the Scalable Nucleotide Alignment Program (SNAP) sequence alignment algorithm (Zaharia M., et al., arXiv:1111.5572v1 [cs.DS] 23 Nov. 2011, the content of which is incorporated by reference herein), which was completed in 394 seconds. Parameters and statistics for the alignment, as described in Zaharia et al., are shown in Table 14, below. Of the 76 million sequence reads, 92,523 reads were not aligned to the reference human genome.
The 92,523 reads that were not mapped to the human reference genome were then aligned to a comprehensive bacterial genome database (curated by the NCBI) using SNAP. This process took 603 seconds. In contrast, aligning all 76 million of the original sequence reads would have taken nearly 140 hours at the same rate. The 92,523 reads that were not mapped to the human reference genome were also aligned to a comprehensive viral genome database (curated by the NCBI) using SNAP. This process took 183 second. In contrast, aligning all 76 million of the original sequence reads would have taken more than 40 hours at the same rate. Parameters and statistics for the alignments, as described in Zaharia et al., are shown in Tables 15 and 16, below.
The species of each aligned bacterial and viral sequence was determined and the number of sequence reads from each species was totaled. The final sequence read counts for each species identified are shown below in Tables 17 and 18.
Acidovorax_delafieldii
Burkholderia_mallei
Candidatus_Pelaqibacter_ubique
Microbacterium_laevaniformans
Micrococcus_luteus
Propionibacterium_sp._409-HC1
Propionibacterium_sp._434-HC2
Vibrio_tubiashii
As shown in Table 18, the method identified 1469 Human gammaherpesvirus 4 (also known as Epstein-Barr virus, EBV) reads and 52 Escherichia virus phiX174 reads, in addition to a low level of three other viruses. Because the number of reads for the former, but not the latter, group of viruses satisfied a predetermined threshold of at least 10 sequence reads, the colorectal cancer is characterized as afflicted with EBV. Notably, EBV is associated with at least Hodgkin lymphoma, Burkitt’s lymphoma, and nasopharyngeal cancers. Accordingly, this information could be used to inform treatment of the colorectal cancer. The Escherichia virus phiX174 reads can be discounted because the virus is a common contaminant in genome sequencing experiments (see, for example, Mukherjee S., et al., Stand. Genomic Sci. 10:18 (2015)), and does not infect human cells.
In order to evaluate the improvement in oncogenic pathogen detection provided by using capture probes against one or more viral targets, the bioinformatics method described herein was applied to data generated from molecular biopsy assays the did and did not include such capture probes. As shown in Table 19, inclusion of capture probes against sequences from oncogenic pathogens improved HPV detection by greater than 400% (average detection without oncogenic capture probes = 0.0167; average detection with oncogenic capture probes = 0.686).
Assay 2 sequences the entire coding region (exome) of the human genome. It is optimized for formalin fixed paraffin embedded (FFPE) tumor tissue samples. The FFPE tumor tissue is matched to a normal blood or saliva sample to ensure fidelity of somatic variant calling. Assay 2 is designed to identify actionable oncologic variants as well as neoantigens across the exome thus enabling immuno-oncology applications.
Assay 3 is a non-invasive, liquid biopsy panel of 105 genes focused on oncogenic and resistance mutations in cell-free DNA (cfDNA). The assay provides approximately 20,000x DNA sequencing coverage over the target sequences. This panel is designed to provide clinical decision support for solid tumors.
Assay 4 combines a 595 gene somatic and germline DNA sequencing panel with RNA-sequencing. For solid tumors, it uses an FFPE tumor sample with a matched normal saliva or blood sample. For circulating hematologic malignancies, a blood or bone marrow sample is used. The assay is designed to identify actionable oncologic variants and is capable of detecting both somatic and germline single nucleotide polymorphisms (SNPs), indels less than 100 bp, copy number variants, and rearrangements in a targeted subset of clinically actionable genes via a single DNA sample. Further information on Assay 4 is provided in Beaubier N, et al., Oncotarget, 10(24):2384-96 (2019), which is incorporated by reference herein. Assays 5 and 6 integrate target probes against the oncogenic pathogen genes listed in Table 2 into the framework of Assay 4.
Referring to
In accordance with block 1302 of
In accordance with block 1304 of
Referring to block 1306 of
In accordance with block 1310 of
In accordance with block 1312 of
The RNA sequencing data was then normalized using gene length data, guanine-cytosine (GC) content data, and depth of sequencing data, by normalizing the gene length data for at least one gene to reduce systematic bias, normalizing the GC content data for the at least one gene to reduce systematic bias, and normalizing the depth of sequencing data for each sample, as described in U.S. Provisional Application Serial No. 62/735,349 and U.S. Pat. Application Serial No. 16/581,706, the contents of which are hereby incorporated herein by reference, in their entireties, for all purposes. The RNA sequencing data was also corrected against a standard gene expression dataset by comparing the sequence data for at least one gene in the gene expression dataset to sequence data in the standard gene expression dataset, as described in U.S. Provisional Application Serial No. 62/735,349 and U.S. Pat. Application Serial No. 16/581,706. The normalized and corrected RNA expression data for the twenty-four genes identified in Table 21, as well as the patient’s CDKN2A and TP53 allele statuses, were then input into the HPV detection classifier trained in Example 3, to determine the HPV viral status of the patient.
Referring to
In accordance with block 1204 of
Next, in accordance with block 1218 of
In some embodiments, additional genes were included in the discriminating set of genes based on the presence or absence of mutations (e.g., the number of mutations) in the additional genes. In this example, as detailed in
Next, in accordance with block 1242 of
In a fourth model, the classifier used was this same SVM classifier, in which the training was on the 427 subjects using the TCGA gene abundance levels for the genes listed in
To validate the model, the trained SVM classifier reported in
Each of the 133 validation subjects were run against the trained SVM whose performance is reported in
This example confirms viral infections are generally associated with an upregulation of immune responses. This example further shows that viral detection based on whole transcriptome data is a useful clinical tool in its own right, and further can be combined with existing diagnostic methods to provide insights about the viral status and tumor microenvironment in a single test.
Referring to
In accordance with block 1204 of
Next, in accordance with block 1218 of
In some embodiments, additional genes were included in the discriminating set of genes based on the presence or absence of mutations (e.g., the number of mutations) in the additional genes. In this example, as detailed in
Next, in accordance with block 1242 of
In a fourth model, the classifier used was this same SVM classifier, in which the training was on the 212 subjects and using the TCGA gene abundance levels for the genes listed in
To validate the model, the trained SVM classifier reported in
Each of the 55 validation subjects were run against the trained SVM whose performance is reported in
In this example, patient samples were processed through RNA whole exome short-read next generation sequencing (NGS) to generate RNA sequencing data, and the RNA sequencing data were processed by a bioinformatics pipeline to generate an RNA-seq expression profile for each patient sample. Specifically, solid tumor total nucleic acid (DNA and RNA) was extracted from macro-dissected FFPE tissue sections and digested by proteinase K to eliminate proteins. RNA was purified from the total nucleic acid by TURBO DNase-I to eliminate DNA, followed by a reaction cleanup using RNA clean XP beads to remove enzymatic proteins. The isolated RNA was subjected to a quality control protocol using RiboGreen fluorescent dye to determine concentration of the RNA molecules.
Library preparation was performed using the KAPA Hyper Prep Kit in which 100 ng of RNA was heat fragmented in the presence of magnesium to an average size of 200 bp. The libraries were then reverse transcribed into cDNA and Roche SeqCap dual end adapters were ligated onto the cDNA. cDNA libraries were then purified and subjected to size selection using KAPA Hyper Beads. Libraries were then PCR amplified for 10 cycles and purified using Axygen MAG PCR clean up beads. Quality control was performed using a PicoGreen fluorescent kit to determine cDNA library concentration. cDNA libraries were then pooled into 6-plex hybridization reactions. Each pool was treated with Human COT-1 and IDT xGen Universal Blockers before being dried in a vacufuge. RNA pools were then resuspended in IDT xGen Lockdown hybridization mix, and IDT xGen Exome Research Panel v1.0 probes were added to each pool. Pools were incubated to allow probes to hybridize. Pools were then mixed with Streptavidin-coated beads to capture the hybridized molecules of cDNA. Pools were amplified and purified once more using the KAPA HiFi Library Amplification kit and Axygen MAG PCR clean up beads, respectively. A final quality control step involving PicoGreen pool quantification, and LabChip GX Touch was performed to assess pool fragment size. Pools were cluster amplified using Illumina Paired-end Cluster Kits with a PhiX-spike in on Illumina C-Bot2, and the resulting flow cell containing amplified target-captured cDNA libraries were sequenced on an Illumina HiSeq 4000 to an average unique on-target depth of 500x to generate a FASTQ file.
In this example, the cDNA library preparation was performed with an automated system, using a liquid handling robot (SciClone NGSx).
Each FASTQ file contained a list of paired-end reads generated by the Illumina sequencer, each of which was associated with a quality rating. The reads in each FASTQ file were processed by a bioinformatics pipeline. FASTQ files were analyzed using FASTQC for rapid assessment of quality control and reads. For each FASTQ file, each read in the file was aligned to a reference genome (GRch37) using kallisto alignment software. This alignment generated a SAM file, and each SAM file was converted to BAM, BAM files were sorted, and duplicates were marked for deletion.
For each gene, the raw RNA read count for a given gene was calculated by kallisto alignment software as a sum of the probability, for each read, that the read aligns to the gene. Raw counts are therefore not integers in this example. The raw read counts were saved in a tabular file for each patient, where columns represented genes and each entry represented the raw RNA read count for that gene.
Raw RNA read counts were then normalized to correct for GC content and gene length using full quantile normalization and adjusted for sequencing depth via the size factor method. Normalized RNA read counts were saved in a tabular file for each patient, where columns represented genes and each entry represented the raw RNA read count for that gene.
All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.
The present invention can be implemented as a computer program product that comprises a computer program mechanism embedded in a non-transitory computer readable storage medium. For instance, the computer program product could contain the program modules shown in any combination in
Many modifications and variations of this invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. The specific embodiments described herein are offered by way of example only. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The invention is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled.
This application claims priority to U.S. Pat. Application No. 16/802,126, filed on Feb. 26, 2020, and U.S. Provisional Pat. Application No. 62/978,067, filed on Feb. 18, 2020, the contents of which are hereby incorporated by reference in their entireties for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/018619 | 2/18/2021 | WO |
Number | Date | Country | |
---|---|---|---|
62978067 | Feb 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16802126 | Feb 2020 | US |
Child | PCT/US2021/018619 | WO |