Thyroid cancer incidence has increased substantially in the United States in recent decades, with evidence to support both an increase in detection and a true increase in occurrence. Thyroid nodules are palpable in 5% of adults and are visualized with contemporary imaging in more than one-third of adults. Malignancy is present in only 5% to 15% of all thyroid nodules, and definitive diagnosis is achieved by surgical histopathology on resected tissue. Unfortunately, thyroid surgery is associated with discomfort, scarring, inconvenience, direct and indirect costs, potential lifelong medication, and occasional surgical complications. Efforts to exclude cancer with clinical assessment alone are admittedly imperfect, and laboratory testing of serum thyroid stimulating hormone levels and thyroid imaging with radionuclides or ultrasonography identify benignity with high confidence in only 4% to 26% of nodules. Forty years ago, the application of cytology to thyroid nodule specimens obtained by fine-needle aspiration (FNA) biopsy had a substantial effect on patient management by reducing surgery by one half and doubling the proportion of cancer among patients who underwent surgery. However, approximately one-third of thyroid nodule cytology findings today are cytologically indeterminate, with estimated risks of malignancy ranging from 5% to 30%. Consequently, approximately three quarters of patients with cytologically indeterminate thyroid nodules have been referred for surgery, even though 80% ultimately prove to have benign nodules.
The present disclosure describes enhanced technologies for characterizing genomic information, including improved methods for the measurement of RNA transcriptome expression and sequencing of nuclear and mitochondrial RNAs, measurement changes in genomic copy number, including loss of heterozygosity, and the development of enhanced bioinformatics and machine learning strategies, resulting in a more robust genomic test.
An aspect of the present disclosure provides a method for processing or analyzing a tissue sample of a subject, comprising: (a) subjecting a first portion of the tissue sample to cytological analysis that indicates that the first portion of the tissue sample is cytologically indeterminate; (b) upon identifying the first portion of the tissue sample as being cytologically indeterminate, assaying by sequencing, array hybridization, or nucleic acid amplification a plurality of gene expression products from a second portion of the tissue sample to yield a first data set; (c) in a programmed computer, using a trained algorithm that comprises one or more classifiers to process the first data set from (b) to generate a classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant, wherein the one or more classifiers comprises an ensemble classifier integrated with at least one index selected from the group consisting of: a follicular content index, a Hürthle cell index, and a Hürthle neoplasm index; and (d) outputting a report indicative of the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant.
In some embodiments, the plurality of gene expression products include two or more of sequences corresponding to mRNA transcripts, mitochondrial transcripts, and chromosomal loss of heterozygosity. In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a specificity of at least about 60%. In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a specificity of at least about 68%. In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a specificity of at least about 70%. In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a sensitivity of at least about 90%.
In some embodiments, the one or more classifiers comprises the ensemble classifier integrated with the follicular content index, the Hürthle cell index, and the Hürthle neoplasm index. In some embodiments, the one or more classifiers further comprises one or more upstream classifiers, wherein the one or more upstream classifiers are selected from the group consisting of: a parathyroid classifier, a medullary thyroid cancer (MTC) classifier, a variant detection classifier, and a fusion transcript detection classifier. In some embodiments, the one or more classifiers comprises a parathyroid classifier that identifies a presence or an absence of a parathyroid tissue in the second portion of the tissue sample. In some embodiments, the upon identification of the absence of the parathyroid tissue in the second portion of the tissue sample by the parathyroid classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the the one or more classifiers comprises a medullary thyroid cancer (MTC) classifier that identifies a presence or an absence of a medullary thyroid cancer (MTC) in the second portion of the tissue sample. In some embodiments, the upon identification of the absence of the MTC in the second portion of the tissue sample by the MTC classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the the one or more classifiers comprises a variant detection classifier that identifies a presence or an absence of a BRAF mutation in the second portion of the tissue sample. In some embodiments, the BRAF mutation is a BRAF V600E mutation. In some embodiments, the upon identification of the absence of the BRAF mutation in the second portion of the tissue sample by the variant detection classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the one or more classifiers comprises a fusion transcript detection classifier that identifies a presence or an absence of a RET/PTC gene fusion in the second portion of the tissue sample. In some embodiments, the RET/PTC gene fusion is RET/PTC1 or RET/PTC3 gene fusion. In some embodiments, the upon identification of the absence of the RET/PTC gene fusion in the second portion of the tissue sample by the fusion transcript detection classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the follicular content index identifies follicular content in the second portion of the tissue sample.
In some embodiments, the ensemble classifier analyzes, in the first data set, sequence information corresponding to at least 500 genes of Table 3. In some embodiments, the ensemble classifier analyzes, in the first data set, sequence information corresponding to at least 1000 genes of Table 3. In some embodiments, the ensemble classifier analyzes, in the first data set, sequence information corresponding to 1115 genes of Table 3.
In some embodiments, the method further comprising (e) upon identifying the second portion of the tissue sample as being suspicious for malignancy, or malignant (i) processing the first data set to identify one or more genetic aberrations in one or more genes listed in
In some embodiments, the tissue sample is a thyroid tissue sample. In some embodiments, the tissue sample is a needle aspirate sample. In some embodiments, the needle aspirate sample is a fine needle aspirate sample. In some embodiments, the malignancy is thyroid cancer.
Another aspect of the present disclosure provides a method for processing or analyzing a tissue sample of a subject, comprising: (a) subjecting a first portion of the tissue sample to cytological analysis that indicates that the first portion of the tissue sample is cytologically indeterminate; (b) upon identifying the first portion of the tissue sample as being cytologically indeterminate, assaying by sequencing, array hybridization, or nucleic acid amplification a plurality of gene expression products from a second portion of the tissue sample to yield a first data set, wherein the plurality of gene expression products include two or more of sequences corresponding to mRNA transcripts, mitochondrial transcripts, and chromosomal loss of heterozygosity; (c) in a programmed computer, using a trained algorithm that comprises one or more classifiers to process the first data set from (b) to generate a classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant; and (d) outputting a report indicative of the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant.
In some embodiments, the one or more classifiers comprises an ensemble classifier integrated with at least one index selected from the group consisting of: a follicular content index, a Hürthle cell index, and a Hürthle neoplasm index. In some embodiments, the one or more classifiers comprises an ensemble classifier integrated with a follicular content index, a Hürthle cell index, and a Hürthle neoplasm index.
In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a specificity of at least about 60%. In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a specificity of at least about 68%. In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a specificity of at least about 70%. In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a sensitivity of at least about 90%.
In some embodiments, the one or more classifiers further comprises one or more upstream classifiers, wherein the one or more upstream classifiers are selected from the group consisting of: a parathyroid classifier, a medullary thyroid cancer (MTC) classifier, a variant detection classifier, and a fusion transcript detection classifier. In some embodiments, the one or more classifiers comprises a parathyroid classifier that identifies a presence or an absence of a parathyroid tissue in the second portion of the tissue sample. In some embodiments, the upon identification of the absence of the parathyroid tissue in the second portion of the tissue sample by the parathyroid classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the one or more classifiers comprises a medullary thyroid cancer (MTC) classifier that identifies a presence or an absence of a medullary thyroid cancer (MTC) in the second portion of the tissue sample. In some embodiments, the upon identification of the absence of the MTC in the second portion of the tissue sample by the MTC classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the one or more classifiers comprises a variant detection classifier that identifies a presence or an absence of a BRAF mutation in the second portion of the tissue sample. In some embodiments, the BRAF mutation is a BRAF V600E mutation. In some embodiments, the upon identification of the absence of the BRAF mutation in the second portion of the tissue sample by the variant detection classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the one or more classifiers comprises a fusion transcript detection classifier that identifies a presence or an absence of a RET/PTC gene fusion in the second portion of the tissue sample. In some embodiments, the RET/PTC gene fusion is RET/PTC1 or RET/PTC3 gene fusion. In some embodiments, the upon identification of the absence of the RET/PTC gene fusion in the second portion of the tissue sample by the fusion transcript detection classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the follicular content index identifies follicular content in the second portion of the tissue sample.
In some embodiments, the one or more classifiers of the trained algorithm comprises an ensemble classifier, wherein the ensemble classifier analyzes, in the first data set, sequence information corresponding to at least 500 genes of Table 3. In some embodiments, the one or more classifiers of the trained algorithm comprises ensemble classifier, wherein the ensemble classifier analyzes, in the first data set, sequence information corresponding to at least 1000 genes of Table 3. In some embodiments, the one or more classifiers of the trained algorithm comprises ensemble classifier, wherein the ensemble classifier analyzes, in the first data set, sequence information corresponding to 1115 genes of Table 3.
In some embodiments, the method further comprising (e) upon identifying the second portion of the tissue sample as being suspicious for malignancy, or malignant (i) processing the first data set to identify one or more genetic aberrations in one or more genes listed in
In some embodiments, the tissue sample is a thyroid tissue sample. In some embodiments, the tissue sample is a needle aspirate sample. In some embodiments, the needle aspirate sample is a fine needle aspirate sample. In some embodiments, the malignancy is thyroid cancer.
Another aspect of the present disclosure provides a method for processing or analyzing a tissue sample of a subject, comprising: (a) subjecting a first portion of the tissue sample to cytological analysis that indicates that the first portion of the sample is cytologically indeterminate; (b) upon identifying the first portion of the tissue sample as being cytologically indeterminate, assaying by sequencing, array hybridization, or nucleic acid amplification a plurality of gene expression products from a second portion of the tissue sample to yield a first data set; (c) in a programmed computer, using a trained algorithm that comprises one or more classifiers to process the first data set from (b) to generate a classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant with a specificity of at least about 60%; and (d) outputting a report indicative of the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant.
In some embodiments, the one or more classifiers comprises an ensemble classifier integrated with at least one index selected from the group consisting of: a follicular content index, a Hürthle cell index, and a Hürthle neoplasm index. In some embodiments, the one or more classifiers comprises an ensemble classifier integrated with a follicular content index, a Hürthle cell index, and a Hürthle neoplasm index. In some embodiments, the plurality of gene expression products include two or more of sequences corresponding to mRNA transcripts, mitochondrial transcripts, and chromosomal loss of heterozygosity.
In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a specificity of at least about 68%. In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a specificity of at least about 70%. In some embodiments, the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant has a sensitivity of at least about 90%.
In some embodiments, the one or more classifiers further comprises one or more upstream classifiers, wherein the one or more upstream classifiers are selected from the group consisting of: a parathyroid classifier, a medullary thyroid cancer (MTC) classifier, a variant detection classifier, and a fusion transcript detection classifier. In some embodiments, the one or more classifiers comprises a parathyroid classifier that identifies a presence or an absence of a parathyroid tissue in the second portion of the tissue sample. In some embodiments, upon identification of the absence of the parathyroid tissue in the second portion of the tissue sample by the parathyroid classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the one or more classifiers comprises a medullary thyroid cancer (MTC) classifier that identifies a presence or an absence of a medullary thyroid cancer (MTC) in the second portion of the tissue sample. In some embodiments, the upon identification of the absence of the MTC in the second portion of the tissue sample by the MTC classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the one or more classifiers comprises a variant detection classifier that identifies a presence or an absence of a BRAF mutation in the second portion of the tissue sample. In some embodiments, the BRAF mutation is a BRAF V600E mutation. In some embodiments, the upon identification of the absence of the BRAF mutation in the second portion of the tissue sample by the variant detection classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the one or more classifiers comprises a fusion transcript detection classifier that identifies a presence or an absence of a RET/PTC gene fusion in the second portion of the tissue sample. In some embodiments, the RET/PTC gene fusion is RET/PTC1 or RET/PTC3 gene fusion. In some embodiments, the upon identification of the absence of the RET/PTC gene fusion in the second portion of the tissue sample by the fusion transcript detection classifier, the at least one classifier of the one or more classifiers generates the classification of the second portion of the tissue sample as benign, suspicious for malignancy, or malignant. In some embodiments, the follicular content index identifies follicular content in the second portion of the tissue sample.
In some embodiments, the one or more classifiers of the trained algorithm comprises an ensemble classifier, wherein the ensemble classifier analyzes, in the first data set, sequence information corresponding to at least 500 genes of Table 3. In some embodiments, the one or more classifiers of the trained algorithm comprises an ensemble classifier, wherein the ensemble classifier analyzes, in the first data set, sequence information corresponding to at least 1000 genes of Table 3. In some embodiments, the one or more classifiers of the trained algorithm comprises an ensemble classifier, wherein the ensemble classifier analyzes, in the first data set, sequence information corresponding to 1115 genes of Table 3.
In some embodiments, the method further comprising (e) upon identifying the second portion of the tissue sample as being suspicious for malignancy, or malignant (i) processing the first data set to identify one or more genetic aberrations in one or more genes listed in
In some embodiments, the tissue sample is a thyroid tissue sample. In some embodiments, the tissue sample is a needle aspirate sample. In some embodiments, the needle aspirate sample is a fine needle aspirate sample. In some embodiments, the malignancy is thyroid cancer.
Another aspect of the present disclosure provides a non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements any of the methods above or elsewhere herein.
Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto. The computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.
Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
All publications and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.
The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:
While various embodiments of the invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.
The term “subject,” as used herein, generally refers to any animal or living organism. Animals can be mammals, such as humans, non-human primates, rodents such as mice and rats, dogs, cats, pigs, sheep, rabbits, and others. Animals can be fish, reptiles, or others. Animals can be neonatal, infant, adolescent, or adult animals. Humans can be more than about 1, 2, 5, 10, 20, 30, 40, 50, 60, 65, 70, 75, or about 80 years of age. The subject may have or be suspected of having a disease, such as cancer. The subject may be a patient, such as a patient being treated for a disease, such as a cancer patient. The subject may be predisposed to a risk of developing a disease such as cancer. The subject may be in remission from a disease, such as a cancer patient. The subject may be healthy.
The term “disease,” as used herein, generally refers to any abnormal or pathologic condition that affects a subject. Examples of a disease include cancer, such as, for example, thyroid cancer, parathyroid cancer, lung cancer, skin cancer, and others. The disease may be treatable or non-treatable. The disease may be terminal or non-terminal. The disease can be a result of inherited genes, environmental exposures, or any combination thereof. The disease can be cancer, a genetic disease, a proliferative disorder, or others as described herein.
The term “sequence variant,” “sequence variation,” “sequence alteration” or “allelic variant,” as used herein, generally refer to a specific change or variation in relation to a reference sequence, such as a genomic deoxyribonucleic acid (DNA) reference sequence, a coding DNA reference sequence, or a protein reference sequence, or others. The reference DNA sequence can be obtained from a reference database. A sequence variant may affect function. A sequence variant may not affect function. A sequence variant can occur at the DNA level in one or more nucleotides, at the ribonucleic acid (RNA) level in one or more nucleotides, at the protein level in one or more amino acids, or any combination thereof. The reference sequence can be obtained from a database such as the NCBI Reference Sequence Database (RefSeq) database. Specific changes that can constitute a sequence variation can include a substitution, a deletion, an insertion, an inversion, or a conversion in one or more nucleotides or one or more amino acids. A sequence variant may be a point mutation. A sequence variant may be a fusion gene. A fusion pair or a fusion gene may result from a sequence variant, such as a translocation, an interstitial deletion, a chromosomal inversion, or any combination thereof. A sequence variation can constitute variability in the number of repeated sequences, such as triplications, quadruplications, or others. For example, a sequence variation can be an increase or a decrease in a copy number associated with a given sequence (i.e., copy number variation, or CNV). A sequence variation can include two or more sequence changes in different alleles or two or more sequence changes in one allele. A sequence variation can include two different nucleotides at one position in one allele, such as a mosaic. A sequence variation can include two different nucleotides at one position in one allele, such as a chimeric. A sequence variant may be present in a malignant tissue. A sequence variant may be present in a benign tissue. Absence of a variant may indicate that a tissue or sample is benign. As an alternative, absence of a variant may not indicate that a tissue or sample is benign.
The term “disease diagnostic,” as used herein, generally refers to diagnosing or screening for a disease, to stratify a risk of occurrence of a disease, to monitor progression or remission of a disease, to formulate a treatment regime for the disease, or any combination thereof. A disease diagnostic can include a) obtaining information from one or more tissue samples from a subject, b) making a determination about whether the subject has a particular disease based on the information or tissue sample obtained, c) stratifying the risk of occurrence of the disease in the subject, d) confirming whether a subject has the disease, is developing the disease, or is in disease remission, or any combination thereof. The disease diagnostic may inform a particular treatment or therapeutic intervention for the disease. The disease diagnostic may also provide a score indicating for example, the severity or grade of a disease such as cancer, or the likelihood of an accurate diagnosis, such as via a p-value, a corrected p-value, or a statistical confidence indicator. The disease diagnostic may also indicate a particular type of a disease. For example, a disease diagnostic for thyroid cancer may indicate a subtype such as follicular adenoma (FA), nodular hyperplasia (NHP), lymphocytic thyroiditis (LCT), Hürthle cell adenoma (HA), follicular carcinoma (FC), papillary thyroid carcinoma (PTC), follicular variant of papillary carcinoma (FVPTC), medullary thyroid carcinoma (MTC), Hürthle cell carcinoma anaplastic thyroid carcinoma (ATC), renal carcinoma (RCC), breast carcinoma (BCA), melanoma (MMN), B cell lymphoma. (BCL), parathyroid (PTA), or hyperplasia papillary carcinoma (HPC).
Some techniques for using preoperative genomic information for thyroid nodule differential diagnosis may involve use messenger RNA (“mRNA”) transcript expression levels to categorize cytologically indeterminate FNAs as either benign or suspicious. Altered messenger RNA expression can occur for several reasons, including complex upstream interactions that occur because of sequence changes in key core genes or in relevant peripheral genes, the effect of epigenetic changes that occur without DNA sequence alterations, and both internal and external modifiers, such as inflammation and lifestyle or environment. Previously, in a cohort with a 24% prevalence of malignancy, a genome expression classifier (“GEC”) accurately identified 90% of malignancies (i.e., sensitivity) and 52% of benign nodules (i.e., specificity) with indeterminate Bethesda III or IV cytology. It intentionally favored high sensitivity over specificity to ensure the accuracy and safety of a benign genomic result. In GEC, a machine learning-derived classification algorithm uses messenger RNA transcript expression levels to categorize cytologically indeterminate samples as either benign or suspicious. A test, as described in the present disclosure, that has improved specificity for identification of benign nodules and maintained high sensitivity for malignancy detection may spare even more patients from surgery with an accurate benign genomic result (negative predictive value [NPV]) and increase the cancer yield among those with a suspicious result (positive predictive value [PPV]).
The present disclosure describes enhanced technologies for characterizing genomic information, including improved methods for the measurement of RNA transcriptome expression and sequencing of nuclear and mitochondrial RNAs, measurement changes in genomic copy number, including loss of heterozygosity, and the development of enhanced bioinformatics and machine learning strategies, resulting in a more robust genomic test.
The present disclosure provides methods for processing or analyzing a tissue sample of a subject to generate a classification of tissue sample as benign, suspicious for malignancy, or malignant. Such methods may comprise obtaining a plurality of gene expression products from a cytologically indeterminate tissue sample and using an algorithm to analyze the gene expression products to classify the tissue samples as benign, suspicious for malignancy, or malignant. In some cases, a plurality of gene expression products comprises sequences corresponding to mRNA transcripts, mitochondrial transcripts, chromosomal loss of heterozygosity, DNA variants and/or fusion transcripts. In some examples, the method uses a trained algorithm that comprises one or more classifiers and is implemented by one or more programmed computer processors to analyze the expression gene products to generate a classification of tissue sample as benign, suspicious for malignancy, or malignant. The algorithm may be a trained algorithm (e.g., an algorithm that is trained on at least 10, 200, 100 or 500 reference samples). References samples may be obtained from subjects having been diagnosed with the disease or from healthy subjects. The trained algorithm may analyze the sequence information of expression gene products corresponding to about 10,000 genes. The trained algorithm may analyze the sequence information of expression gene products corresponding to at least 500 genes of Table 3. The trained algorithm may analyze the sequence information of expression gene products corresponding to at least 600 genes of Table 3. The trained algorithm may analyze the sequence information of expression gene products corresponding to at least 700 genes of Table 3. The trained algorithm may analyze the sequence information of expression gene products corresponding to at least 800 genes of Table 3. The trained algorithm may analyze the sequence information of expression gene products corresponding to at least 900 genes of Table 3. The trained algorithm may analyze the sequence information of expression gene products corresponding to at least 1000 genes of Table 3. The trained algorithm may analyze the sequence information of expression gene products corresponding to at least 1100 genes of Table 3. The trained algorithm may analyze the sequence information of expression gene products corresponding to at least 1200 genes of Table 3.
As set forth in the present disclosure, an expression level of one or more genes of gene expression products can be obtained by assaying for an expression level. Assaying may comprise array hybridization, nucleic acid sequencing, nucleic acid amplification, or others. Assaying may comprise sequencing, such as DNA or RNA sequencing. Such sequencing may be by next generation (NextGen) sequencing, such as high throughput sequencing or whole genome sequencing (e.g., Illumina). Such sequencing may include enrichment. Assaying may comprise reverse transcription polymerase chain reaction (PCR). Assaying may utilize markers, such as primers, that are selected for each of the one or more genes of the first or second sets of genes.
Additional methods for determining gene expression levels may include but are not limited to one or more of the following: additional cytological assays, assays for specific proteins or enzyme activities, assays for specific expression products including protein or RNA or specific RNA splice variants, in situ hybridization, whole or partial genome expression analysis, microarray hybridization assays, serial analysis of gene expression (SAGE), enzyme linked immuno-absorbance assays, mass-spectrometry, immunohistochemistry, blotting, sequencing, RNA sequencing, DNA sequencing (e.g., sequencing of complementary deoxyribonucleic acid (cDNA) obtained from RNA); next generation (Next-Gen) sequencing, nanopore sequencing, pyrosequencing, or Nanostring sequencing. Gene expression product levels may be normalized to an internal standard such as total messenger ribonucleic acid (mRNA) or the expression level of a particular gene.
The methods disclosed herein may include extracting and analyzing protein or nucleic acid (RNA or DNA) from one or more samples from a subject. Nucleic acids can be extracted from the entire sample obtained or can be extracted from a portion. In some cases, the portion of the sample not subjected to nucleic acid extraction may be analyzed by cytological examination or immunohistochemistry. Methods for RNA or DNA extraction from biological samples can include for example phenol-chloroform extraction (such as guanidinium thiocyanate phenol-chloroform extraction), ethanol precipitation, spin column-based purification, or others.
The sample obtained from the subject may be cytologically ambiguous or suspicious (or indeterminate). In some cases, the sample may be suggestive of the presence of a disease. The volume of sample obtained from the subject may be small, such as about 100 microliters, 50 microliters, 10 microliters, 5 microliters, 1 microliter or less. The sample may comprise a low quantity or quality of polynucleotides, such as a tissue sample with degraded or partially degraded RNA. For example, an FNA sample may yield low quantity or quality of polynucleotides. In such examples, the RNA Integrity Number (RIN) value of the sample may be about 9.0 or less. In some examples, the RIN value may be about 6.0 or less.
In some cases, the methods disclosed herein further comprise processing the gene expression products using an a curated panel of sequence associated with variants and/or fusions and which includes well validated variants and variants whose clinical significance is emerging (such as, for example the Xpression Atlas to provide further genomic information on samples identified as being suspicious for malignancy, or malignant, the method comprising identifying any one of the genetic aberrations disclosed in in one or more genes listed in
The genetic aberrations may be validated or may have emerging clinical significance. The risk of malignancy may characterize one or more genetic aberrations as (1) highly associated with malignant nodules, (2) associated with both benign and malignant nodules, or (3) as having insufficient published evidence to characterize such risk. One or more genetic aberrations in one or more genes listed in
The methods disclosed herein provide identifying one or more genetic aberrations in a sample that are indicative of a histological subtype. Histological subtypes may include classical parathyroid cancer (cPTC), infiltrative follicular variant of papillary thyroid carcinoma (infiltrative FVPTC), noninvasive encapsulated FVPTC (EFVPTC), Follicular thyroid carcinoma (FTC), and/or follicular adenomas (FA).
The methods disclosed herein comprise identifying one or more genetic aberrations in a sample to indicate prognosis associated with the genetic aberration. Prognostic information may comprise TNM stage and American Thyroid Association (ATA) risk. The TNM Staging System is based on the extent of the tumor (T), the extent of spread to the lymph nodes (N), and the presence of metastasis (M). The T category describes the original (primary) tumor. The TNM stage may comprise stages 1-4. ATA risk of recurrence staging system may comprises risk categories 1-3 which may correspond to low, intermediate, or high risk categories. The 761 nucleotide variant panel may have a PPA rate of at least 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or more. The 130 fusion panel may have a PPA rate of at least 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or more. Identification of one or more genetic aberrations may increase the risk of malignancy reported by one or more classifiers as used in the methods disclosed herein. Identification of one or more genetic aberrations may not increase the risk of malignancy reported by one or more classifiers as used in the methods disclosed herein. A reported risk of malignancy generated by one or more classifiers of the present disclosure may not be reduced in some cases where no genetic aberrations in one or more genes listed in
A sample obtained from a subject can comprise tissue, cells, cell fragments, cell organelles, nucleic adds, genes, gene fragments, expression products, gene expression products, gene expression product fragments or any combination thereof. A sample can be heterogeneous or homogenous. A sample can comprise blood, urine, cerebrospinal fluid, seminal fluid, saliva, sputum, stool, lymph fluid, tissue, or any combination thereof. A sample can be a tissue-specific sample such as a sample obtained from a thyroid, skin, heart, lung, kidney, breast, pancreas, liver, muscle, smooth muscle, bladder, gall bladder, colon, intestine, brain, esophagus, or prostate.
A sample of the present disclosure can be obtained by various methods, such as, for example, fine needle aspiration (FNA), core needle biopsy, vacuum assisted biopsy, incisional biopsy, excisional biopsy, punch biopsy, shave biopsy, skin biopsy, or any combination thereof.
FNA, also referred to as fine needle aspirate biopsy (FNAB), or needle aspirate biopsy (NAB), is a method of obtaining a small amount of tissue from a subject. FNA can be less invasive than a tissue biopsy, which may require surgery and hospitalization of the subject to obtain the tissue biopsy. The needle of a FNA method can be inserted into a tissue mass of a subject to obtain an amount of sample for further analysis. In some cases, two needles can be inserted into the tissue mass. The FNA sample obtained from the tissue mass may be acquired by one or more passages of the needle across the tissue mass. In some cases, the FNA sample can comprise less than about 6×106, 5×106, 4×106, 3×106, 2×106, 1×106 cells or less. The needle can be guided to the tissue mass by ultrasound or other imaging device. The needle can be hollow to permit recovery of the FNA sample through the needle by aspiration or vacuum or other suction techniques.
Samples obtained using methods disclosed herein, such as an FNA sample, may comprise a small sample volume. A sample volume may be less than about 500 microliters (uL), 400 uL, 300 uL, 200 uL, 100 uL, 75 uL, 50 uL, 25 uL, 20 uL, 15 uL, 10 uL, 5 uL, 1 uL, 0.5 uL, 0.1 uL, 0.01 uL or less. The sample volume may be less than about 1 uL. The sample volume may be less than about 5 uL. The sample volume may be less than about 10 uL. The sample volume may be less than about 20 uL. The sample volume may be between about 1 uL and about 10 uL. The sample volume may be between about 10 uL and about 25 uL.
Samples obtained using methods disclosed herein, such as an FNA sample, may comprise small sample weights. The sample weight, such as a tissue weight, may be less than about 100 milligrams (mg), 75 mg, 50 mg, 25 mg, 20 mg, 15 mg, 10 mg, 9 mg, 8 mg, 7 mg, 6 mg, 5 mg, 4 mg, 3 mg, 2 mg, 1 mg, 0.5 mg, 0.1 mg or less. The sample weight may be less than about 20 mg. The sample weight may be less than about 10 mg. The sample weight may be less than about 5 mg. The sample weight may be between about 5 mg and about 20 mg. The sample weight may be between about 1 mg and about 5 ng.
Samples obtained using methods disclosed herein, such as FNA, may comprise small numbers of cells. The number of cells of a single sample may be less than about 10×106, 5.5×106, 5×106, 4.5×106, 4×106, 3.5×106, 3×106, 2.5×106, 2×106, 1.5×106, 1×106, 0.5×106, 0.2×106, 0.1×106 cells or less. The number of cells of a single sample may be less than about 5×106 cells. The number of cells of a single sample may be less than about 4×106 cells. The number of cells of a single sample may be less than about 3×106 cells. The number of cells of a single sample may be less than about 2×106 cells. The number of cells of a single sample may be between about 1×106 and about 5×106 cells. The number of cells of a single sample may be between about 1×106 and about 10×106 cells.
Samples obtained using methods disclosed herein, such as FNA, may comprise small amounts of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). The amount of DNA or RNA in an individual sample may be less than about 500 nanograms (ng), 400 ng, 300 ng, 200 ng, 100 ng, 75 ng, 50 ng, 45 ng, 40 ng, 35 ng, 30 ng, 25 ng, 20 ng, 15 ng, 10 ng, 5 ng, 1 ng, 0.5 ng, 0.1 ng, or less. The amount of DNA or RNA may be less than about 40 ng. The amount of DNA or RNA may be less than about 25 ng. The amount of DNA or RNA may be less than about 15 ng. The amount of DNA or RNA may be between about 1 ng and about 25 ng. The amount of DNA or RNA may be between about 5 ng and about 50 ng.
RNA yield or RNA amount of a sample can be measured in nanogram to microgram amounts. An example of an apparatus that can be used to measure nucleic acid yield in the laboratory is a NANODROP® spectrophotometer, QUBIT® fluorometer, or QUANTUS™ fluorometer. The accuracy of a NANODROP® measurement may decrease significantly with very low RNA concentration, Quality of data obtained from the methods described herein can be dependent on RNA quantity. Meaningful gene expression or sequence variant data or others can be generated from samples having a low or un-measurable RNA concentration as measured by NANODROP®. In some cases, gene expression or sequence variant data or others can be generated from a sample having an immeasurable RNA concentration.
The methods as described herein can be performed using samples with low quantity or quality of polynucleotides, such as DNA or RNA. A sample with low quantity or quality of RNA can be for example a degraded or partially degraded tissue sample. A sample with low quantity or quality of RNA may be a fine needle aspirate (FNA) sample. The RNA quality of a sample can be measured by a calculated RNA Integrity Number (RIN) value. The RUN value is an algorithm for assigning integrity values to RNA measurements. The algorithm can assign a 1 to 10 RIN value, where an RIN value of 10 can be completely intact RNA. A sample as described herein that comprises RNA can have an RIN value of about 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0 or less. In some cases, a sample comprising RNA can have an MN value equal or less than about 8.0. In some cases, a sample comprising RNA can have an RIN value equal or less than about 6.0. In some cases, a sample comprising RNA can have an RIN value equal or less than about 4.0. In some cases, a sample can have an RIN value of less than about 2.0.
A sample, such as an FNA sample, may be obtained from a subject by another individual or entity, such as a healthcare (or medical) professional or robot. A medical professional can include a physician, nurse, medical technician or other. In some cases, a physician may be a specialist, such as an oncologist, surgeon, or endocrinologist. A medical technician may be a specialist, such as a cytologist, phlebotomist, radiologist, pulmonologist or others, A medical professional may obtain a sample from a subject for testing or refer the subject to a testing center or laboratory for the submission of the sample. The medical professional may indicate to the testing center or laboratory the appropriate test or assay to perform on the sample, such as methods of the present disclosure including determining gene sequence data, gene expression levels, sequence variant data, or any combination thereof.
In some cases, a medical professional need not be involved in the initial diagnosis of a disease or the initial sample acquisition. An individual, such as the subject, may alternatively obtain a sample through the use of an over the counter kit. The kit may contain collection unit or device for obtaining the sample as described herein, a storage unit for storing the sample ahead of sample analysis, and instructions for use of the kit.
A sample can be obtained a) pre-operatively, b) post-operatively, c) after a cancer diagnosis, d) during routine screening following remission or cure of disease, e) when a subject is suspected of having a disease, f) during a routine office visit or clinical screen, g) following the request of a medical professional, or any combination thereof. Multiple samples at separate times can be obtained from the same subject, such as before treatment for a disease commences and after treatment ends, such as monitoring a subject over a time course. Multiple samples can be obtained from a subject at separate times to monitor the absence or presence of disease progression, regression, or remission in the subject.
The methods as described herein may include cytological analysis of samples. Examples of cytological analysis include cell staining techniques and/or microscope examination performed by any number of methods and suitable reagents including but not limited to: eosin-azure (EA) stains, hematoxylin stains. CYTO-STAIN™, papanicolaou stain, eosin, nissl stain, toluidine blue, silver stain, azocarmine stain, neutral red, or janus green. More than one stain can be used in combination with other stains. In some cases, cells are not stained at all. Cells can be fixed and/or permeabilized with for example methanol, ethanol, glutaraldehyde or formaldehyde prior to or during the staining procedure. In some cases, the cells may not be fixed. Staining procedures can also be utilized to measure the nucleic acid content of a sample, for example with ethidium bromide, hematoxylin, nissl stain or any other nucleic acid stain.
Microscope examination of cells in a sample can include smearing cells onto a slide by standard methods for cytological examination. Liquid based cytology (LBC) methods may be utilized. In some cases, LBC methods provide for an improved approach of cytology slide preparation, more homogenous samples, increased sensitivity and specificity, or improved efficiency of handling of samples, or any combination thereof. In LBC methods, samples can be transferred from the subject to a container or vial containing a LBC preparation solution such as for example CYTYC THINPREP®, SUREPATH™, or MONOPREP® or any other LBC preparation solution. Additionally, the sample may be rinsed from the collection device with LBC preparation solution into the container or vial to ensure substantially quantitative transfer of the sample. The solution containing the sample in LBC preparation solution may then be stored and/or processed by a machine or by one skilled in the art to produce a layer of cells on a glass slide. The sample may further be stained and examined under the microscope in the same way as a conventional cytological preparation.
Samples can be analyzed by immuno-histochemical staining. Immuno-histochemical staining can provide analysis of the presence, location, and distribution of specific molecules or antigens by use of antibodies in a sample (e.g. cells or tissues). Antigens can be small molecules, proteins, peptides, nucleic acids or any other molecule capable of being specifically recognized by an antibody. Samples may be analyzed by immuno-histochemical methods with or without a prior fixing and/or permeabilization step. In some cases, the antigen of interest may be detected by contacting the sample with an antibody specific for the antigen and then non-specific binding may be removed by one or more washes. The specifically bound antibodies may then be detected by an antibody detection reagent such as for example a labeled secondary antibody, or a labeled avidin/streptavidin. The antigen specific antibody can be labeled directly. Suitable labels for immunohistochemistry include but are not limited to fluorophores such as fluorescein and rhodamine, enzymes such as alkaline phosphatase and horse radish peroxidase, or radionuclides such as 32P and 125I. Gene product markers that may be detected by immuno-histochemical staining include but are not limited to Her2/Neu, Ras, Rho, EGFR, VEGFR, UbcH10, RET/PTC1, cytokeratin 20, calcitonin, GAL-3, thyroid peroxidase, or thyroglobulin.
Metrics associated with classifying a tissue sample as disclosed herein, such as sequences corresponding to mRNA transcripts, mitochondrial transcripts, and/or chromosomal loss of heterozygosity, need not be a characteristic of every cell of a sample found to comprise the tissue classification. Thus, the methods disclosed herein can be useful for classifying a tissue sample, e.g. as benign, suspicious for malignancy, or malignant for cancer, within a tissue where less than all cells within the sample exhibit a complete pattern of the gene expression levels or sequence variant data, or other data indicative of tissue classification. The gene expression levels, sequence variant data, or others may be either completely present, partially present, or absent within affected cells, as well as unaffected cells of the sample. The gene expression levels, sequence variant data, or others may be present in variable amounts within affected cells. The gene expression levels, sequence variant data, or others may be present in variable amounts within unaffected cells. In some cases, the gene expression levels of a first set of genes or the presence of one or more sequence variants in a second set of genes that correlates with a risk of malignancy occurrence can be positively detected. In some instances, positive detection can occur in at least 70%, 75%, 80%, 85%, 90%, 95%, or 100% of cells drawn from a sample. In some cases, the gene expression levels of a first set of genes or the presence of one or more sequence variants in a second set of genes can be absent. In some instances, absence of detection can occur in at least 70%, 75%, 80%, 85%, 90%, 95%, or 100% of cells of a corresponding normal or benign, non-disease sample.
Routine cytological or other assays may indicate a sample as negative (without disease), diagnostic (positive diagnosis for disease, such as cancer), ambiguous or suspicious (e.g., indeterminate) (suggestive of the presence of a disease, such as cancer), or non-diagnostic (providing inadequate information concerning the presence or absence of disease). The methods as described herein may confirm results from the routine cytological assessments or may provide an original assessment similar to a routine cytological assessment in the absence of one. The methods as described herein may classify a sample as malignant or benign, including samples found to be ambiguous, suspicious, or indeterminate. The methods may further stratify samples, such as samples known to be malignant, into low risk and medium-to-high risk groups of disease occurrence, including samples found to be ambiguous, suspicious, or indeterminate.
Suitable reagents for conducting array hybridization, nucleic acid sequencing, nucleic acid amplification or other amplification reactions include, but are not limited to, DNA polymerases, markers such as forward and reverse primers, deoxynucleotide triphosphates (dNTPs), and one or more buffers. Such reagents can include a primer that is selected for a given sequence of interest, such as the one or more genes of the first set of genes and/or second set of genes.
In such amplification reactions, one primer of a primer pair can be a forward primer complementary to a sequence of a target polynucleotide molecule (e.g. the one or more genes of the first or second sets) and one primer of a primer pair can be a reverse primer complementary to a second sequence of the target polynucleotide molecule and a target locus can reside between the first sequence and the second sequence.
The length of the forward primer and the reverse primer can depend on the sequence of the target polynucleotide (e.g. the one or more genes of the first or second sets) and the target locus. In some cases, a primer can be greater than or equal to about 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 65, 70, 75, 80, 85, 90, 95, or about 100 nucleotides in length. As an alternative, a primer can be less than about 100, 95, 90, 85, 80, 75, 70, 65, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, or about nucleotides in length. In some cases, a primer can be about 15 to about 20, about 15 to about 25, about 15 to about 30, about 15 to about 40, about 15 to about 45, about 15 to about 50, about 15 to about 55, about 15 to about 60, about 20 to about 25, about 20 to about 30, about 20 to about 35, about 20 to about 40, about 20 to about 45, about 20 to about 50, about 20 to about 55, about 20 to about 60, about 20 to about 80, or about 20 to about 100 nucleotides in length.
Primers can be designed according to known parameters for avoiding secondary structures and self-hybridization, such as primer dimer pairs. Different primer pairs can anneal and melt at about the same temperatures, for example, within 1° C., 2° C., 3° C., 4° C., 5° C., 6° C., 7° C., 8° C., 9° C. or 10° C. of another primer pair.
The target locus can be about 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 100, 150, 200, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 650, 700, 750, 800, 850, 900 or 1000 nucleotides from the 3′ ends or 5′ ends of the plurality of template polynucleotides.
Markers (i.e., primers) for the methods described can be one or more of the same primer. In some instances, the markers can be one or more different primers such as about 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or more different primers. In such examples, each primer of the one or more primers can comprise a different target or template specific region or sequence, such as the one or more genes of the first or second sets.
One or more primers can comprise a fixed panel of primers. The one or more primers can comprise at least one or more custom primers. The one or more primers can comprise at least one or more control primers. The one or more primers can comprise at least one or more housekeeping gene primers. In some instances, the one or more custom primers anneal to a target specific region or complements thereof. The one or more primers can be designed to amplify or to perform primer extension, reverse transcription, linear extension, non-exponential amplification, exponential amplification, PCR, or any other amplification method of one or more target or template polynucleotides.
Primers can incorporate additional features that allow for the detection or immobilization of the primer but do not alter a basic property of the primer (e.g., acting as a point of initiation of DNA synthesis). For example, primers can comprise a nucleic acid sequence at the 5′ end which does not hybridize to a target nucleic acid, but which facilitates cloning or further amplification, or sequencing of an amplified product. For example, the sequence can comprise a primer binding site, such as a PCR priming sequence, a sample barcode sequence, or a universal primer binding site or others.
A universal primer binding site or sequence can attach a universal primer to a polynucleotide and/or amplicon. Universal primers can include—47F (M13F), alfaMF, AOX3′, AOX5′, BGHr, CMV-30, CMV-50, CVMf, LACrmt, lamgda gt10F, lambda gt 10R, lambda gt11F, lambda gt11R, M13 rev, M13Forward(−20), M13Reverse, male, p10SEQPpQE, pA-120, pet4, pGAP Forward, pGLRVpr3, pGLpr2R, pKLAC14, pQEFS, pQERS, pucU1, pucU2, reversA, seqIREStam, seqIRESzpet, seqori, seqPCR, seqpIRES-, seqpIRES+, seqpSecTag, seqpSecTag+, seqretro+PSI, SP6, T3-prom, T7-prom, and T7-termInv. As used herein, attach can refer to both or either covalent interactions and noncovalent interactions. Attachment of the universal primer to the universal primer binding site may be used for amplification, detection, and/or sequencing of the polynucleotide and/or amplicon.
The trained algorithm of the present disclosure can be trained using a set of samples, such as a sample cohort. The sample cohort can comprise about 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000 or more independent samples. The sample cohort can comprise about 100 independent samples. The sample cohort can comprise about 200 independent samples. The sample cohort can comprise between about 100 and about 700 independent samples. The independent samples can be from subjects having been diagnosed with a disease, such as cancer, from healthy subjects, or any combination thereof.
The sample cohort can comprise samples from about 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000 or more different individuals. The sample cohort can comprise samples from about 100 different individuals. The sample cohort can comprise samples from about 200 different individuals. The different individuals can be individuals having been diagnosed with a disease, such as cancer, health individuals, or any combination thereof.
The sample cohort can comprise samples obtained from individuals living in at least 2, 3, 4, 5, 6, 67 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, or 80 different geographical locations (e.g., sites spread out across a nation, such as the United States, across a continent, or across the world). Geographical locations include, but are not limited to, test centers, medical facilities, medical offices, post office addresses, cities, counties, states, nations, or continents. In some cases, a classifier that is trained using sample cohorts from the United States may need to be re-trained for use on sample cohorts from other geographical regions (e.g., India, Asia, Europe, Africa, etc.).
The trained algorithm may comprise one or more classifiers selected from the group consisting of a parathyroid classifier, a medullary thyroid cancer (MTC) classifier, a variant detection classifier, a fusion transcript detection classifier, an ensemble classifier, a follicular content index, and one or more Hürthle classifiers (e.g., a Hürthle cell index and/or a Hürthle neoplasm index). The ensemble classifier may be integrated with one or more index selected from the group consisting of a follicular content index, a Hürthle cell index, and a Hürthle neoplasm index. A parathyroid classifier may identify a presence or an absence of a parathyroid tissue in the tissue sample. A medullary thyroid cancer (MTC) classifier may identify a presence or an absence of a medullary thyroid cancer (MTC) in the tissue sample. A variant detection classifier may identify a presence or an absence of a BRAF mutation (such as BRAF V600E) in the tissue sample. A fusion transcript detection classifier may identify a presence or an absence of a RET/PTC gene fusion (such as RET/PTC1 and/or RET/PTC3 gene fusion) in the tissue sample. A follicular content index may identify follicular content in the tissue sample. A classifier may identify one or more TRK gene fusions and one or more RET alterations (e.g., a RET gene fusion).
The ensemble classifier may comprise 10,000 or more genes with a set of 1000 or more core genes. The 10,000 or more genes may improve the ensemble classifier stability against variability. The core genes may drive the prediction behavior of the ensemble model. The ensemble classifier may comprise or consist of 12 independent classifiers. The 12 independent classifiers may comprise or consist of 6 elastic net logistic regression models and 6 support vector machine models. The 6 elastic net logistic regression models may each differ from one another according to the gene sets disclosed in Table 2. The 6 support vector machine models may each differ from one another according to the gene sets disclosed in Table 2. The ensemble classifier may analyze the sequence information of expression gene products corresponding to about 10,000 genes. The ensemble classifier may analyze the sequence information of expression gene products corresponding to at least 500 genes of Table 3. The ensemble classifier may analyze the sequence information of expression gene products corresponding to at least 600 genes of Table 3. The ensemble classifier may analyze the sequence information of expression gene products corresponding to at least 700 genes of Table 3. The ensemble classifier may analyze the sequence information of expression gene products corresponding to at least 800 genes of Table 3. The ensemble classifier may analyze the sequence information of expression gene products corresponding to at least 900 genes of Table 3. The ensemble classifier may analyze the sequence information of expression gene products corresponding to at least 1000 genes of Table 3. The ensemble classifier may analyze the sequence information of expression gene products corresponding to at least 1100 genes of Table 3. The ensemble classifier may analyze the sequence information of expression gene products corresponding to at least 1200 genes of Table 3.
In some embodiments, the specificity of the present method is at least 60%, 65%, 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or more.
In some embodiments, the sensitivity of the present method is at least 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or more.
In some embodiments, the specificity is greater than or equal to 60%. The negative predictive value (NPV) is greater than or equal to 95%. In some embodiments, the NPV is at least 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%, 98.5%, 99%, 99.5% or more.
Sensitivity typically refers to TP/(TP+FN), where TP is true positive and FN is false negative. Number of Continued Indeterminate results divided by the total number of malignant results based on adjudicated histopathology diagnosis. Specificity typically refers to TN/(TN+FP), where TN is true negative and FP is false positive. The number of actual benign results is divided by the total number of benign results based on adjudicated histopathology diagnosis. Positive Predictive Value (PPV) may be determined by: TP/(TP+FP). Negative Predictive Value (NPV) may be determined by TN/(TN+FN).
A biological sample may be identified as cancerous with an accuracy of greater than 75%, 80%, 85%, 90%, 95%, 99% or more. In some embodiments, the biological sample is identified as cancerous with a sensitivity of greater than 90%. In some embodiments, the biological sample is identified as cancerous with a specificity of greater than 60%. In some embodiments, the biological sample is identified as cancerous or benign with a sensitivity of greater than 90% and a specificity of greater than 60%. In some embodiments, the accuracy is calculated using a trained algorithm.
Results of the expression analysis of the subject methods may provide a statistical confidence level that a given diagnosis is correct. In some embodiments, such statistical confidence level is above 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 99.5%.
A trained algorithm may produce a unique output each time it is run. For example, using a different sample or plurality of samples with the same classifier can produce a unique output each time the classifier is run. Using the same sample or plurality of samples with the same classifier can produce a unique output each time the classifier is run. Using the same samples to train a classifier more than one time, may result in unique outputs each time the classifier is run.
Characteristics of a sample (e.g., sequence information corresponding to mRNA expression, mitochondrial transcripts, genetic variants and/or fusion transcripts) can be analyzed using an algorithm that comprises one or more classifiers and which is trained using one or more an annotated reference sets. The identification can be performed by the classifier. More than one characteristic of a sample can be combined to generate classification of tissue sample. For example, sequence information corresponding to mRNA expression and mitochondrial transcripts can be combined and a classification can be generated from the combined data. The combining can be performed by the classifier. In another example, sequences obtained from a sample can be compared to a reference set to determine the presence of one or more sequence variants in a sample. In some cases, gene expression levels of one or more genes from a sample can be processed relative to expression levels of a reference set of genes that are used to train one or more classifiers to determine the presence of differential gene expression of one or more genes. A reference set can comprise one or more housekeeping genes. The reference set can comprise known sequence variants or expression levels of genes known to be associated with a particular disease or known to be associated with a non-disease state.
Classifiers of a trained algorithm can perform processing, combining, statistical evaluation, or further analysis of results, or any combination thereof. Separate reference sets may be provided for different features. For example, sequence variant data may be processed relative to a sequence variant data reference set. A gene expression level data may be processed relative to a gene expression level reference set. In some cases, multiple feature spaces may be processed with respect to the same reference set.
In some cases, sequence variants of a particular gene may or may not affect the gene expression level of that same gene. A sequence variant of a particular gene may affect the gene expression level of one or more different genes that may be located adjacent to and distal from the particular gene with the sequence variant. The presence of one or more sequence variants can have downstream effects on one or more genes. A sequence variant of a particular gene may perturb one or more signaling pathways, may cause ribonucleic acid (RNA) transcriptional regulation changes, may cause amplification of deoxyribonucleic acid (DNA), may cause multiple transcript copies to be produced, may cause excessive protein to be produced, may cause single base pairs, multi-base pairs, partial genes or one or more genes to be removed from the sequence.
Data from the methods described, such as gene expression levels or sequence variant data can be further analyzed using feature selection techniques such as filters which can assess the relevance of specific features by looking at the intrinsic properties of the data, wrappers which embed the model hypothesis within a feature subset search, or embedded protocols in which the search for an optimal set of features is built into a classifier algorithm.
Filters useful in the methods of the present disclosure can include, for example, (1) parametric methods such as the use of two sample t-tests, analysis of variance (ANOVA) analyses, Bayesian frameworks, or Gamma distribution models (2) model free methods such as the use of Wilcoxon rank sum tests, between-within class sum of squares tests, rank products methods, random permutation methods, or threshold number of misclassification (TNoM) which involves setting a threshold point for fold-change differences in expression between two datasets and then detecting the threshold point in each gene that minimizes the number of mis-classifications or (3) multivariate methods such as bivariate methods, correlation based feature selection methods (CFS), minimum redundancy maximum relevance methods (MRMR), Markov blanket filter methods, and uncorrelated shrunken centroid methods. Wrappers useful in the methods of the present disclosure can include sequential search methods, genetic algorithms, or estimation of distribution algorithms. Embedded protocols can include random forest algorithms, weight vector of support vector machine algorithms, or weights of logistic regression algorithms.
Statistical evaluation of the results obtained from the methods described herein can provide a quantitative value or values indicative of one or more of the following: the classification of the tissue sample; the likelihood of diagnostic accuracy; the likelihood of disease, such as cancer; the likelihood of a particular disease, such as a tissue-specific cancer, for example, thyroid cancer; and the likelihood of the success of a particular therapeutic intervention. Thus a medical professional, who may not be trained in genetics or molecular biology, need not understand gene expression level or sequence variant data results. Rather, data can be presented directly to the medical professional in its most useful form to guide care or treatment of the subject. Statistical evaluation, combination of separate data results, and reporting useful results can be performed by the trained algorithm. Statistical evaluation of results can be performed using a number of methods including, but not limited to: the students T test, the two sided. T test, pearson rank sum analysis, hidden markov model analysis, analysis of q-q plots, principal component analysis, one way analysis of variance (ANOVA), two way ANOVA, and the like. Statistical evaluation can be performed by the trained algorithm.
A disease, as disclosed herein, can include thyroid cancer. Thyroid cancer can include any subtype of thyroid cancer, including but not limited to, any malignancy of the thyroid gland such as papillary thyroid cancer (PTC), follicular thyroid cancer (FTC), follicular variant of papillary thyroid carcinoma (FVPTC), medullary thyroid carcinoma (MTC), follicular carcinoma (FC), Hürthle cell carcinoma (HC), and/or anaplastic thyroid cancer (ATC). In some cases, the thyroid cancer can be differentiated. In some cases, the thyroid cancer can be undifferentiated.
A thyroid tissue sample can be classified using the methods of the present disclosure as comprising one or more benign or malignant tissue types (e.g. a cancer subtype), including but not limited to follicular adenoma (FA), nodular hyperplasia (NHP), lymphocytic thyroiditis (LCT), and Hürthle cell adenoma (HA), follicular carcinoma (FC), papillary thyroid carcinoma (PTC), follicular variant of papillary carcinoma (FVPTC), medullary thyroid carcinoma (MTC), Hürthle cell carcinoma (HC), and anaplastic thyroid carcinoma (ATC), renal carcinoma (RCC), breast carcinoma (BCA), melanoma (MMN), B cell lymphoma (BCL), or parathyroid (PTA).
In the methods of the present disclosure, a subject may be monitored. For example, a subject may be diagnosed with cancer. This initial diagnosis may or may not involve the use of methods disclosed herein. The subject may be prescribed a therapeutic intervention such as a thyroidectomy for a subject suspected of having thyroid cancer. The results of the therapeutic intervention may be monitored on an ongoing basis by methods disclosed herein to detect the efficacy of the therapeutic intervention. In another example, a subject may be diagnosed with a benign tumor or a precancerous lesion or nodule, and the tumor, nodule, or lesion may be monitored on an ongoing basis by methods disclosed herein to detect any changes in the state of the tumor or lesion.
Methods disclosed herein may also be used to ascertain the potential efficacy of a specific therapeutic intervention prior to administering to a subject. For example, a subject may be diagnosed with cancer. A genomic sequence classifier (GSC) classifier along with Xpression Atlas may indicate a presence of at least one variant associated with highly malignant tumors. In such cases, therapeutic intervention may be customized to the results obtained. A tumor sample may be obtained and cultured in vitro using methods known to the art.
The present disclosure provides computer systems that are programmed to implement methods of the disclosure.
The computer system 1101 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1105, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 1101 also includes memory or memory location 1110 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1115 (e.g., hard disk), communication interface 1120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 1125, such as cache, other memory, data storage and/or electronic display adapters. The memory 1110, storage unit 1115, interface 1120 and peripheral devices 1125 are in communication with the CPU 1105 through a communication bus (solid lines), such as a motherboard. The storage unit 1115 can be a data storage unit (or data repository) for storing data. The computer system 1101 can be operatively coupled to a computer network (“network”) 1130 with the aid of the communication interface 1120. The network 1130 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 1130 in some cases is a telecommunication and/or data network. The network 1130 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 1130, in some cases with the aid of the computer system 1101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 1101 to behave as a client or a server.
The CPU 1105 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 1110. The instructions can be directed to the CPU 1105, which can subsequently program or otherwise configure the CPU 1105 to implement methods of the present disclosure. Examples of operations performed by the CPU 1105 can include fetch, decode, execute, and writeback.
The CPU 1105 can be part of a circuit, such as an integrated circuit. One or more other components of the system 1101 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
The storage unit 1115 can store files, such as drivers, libraries and saved programs. The storage unit 1115 can store user data, e.g., user preferences and user programs. The computer system 1101 in some cases can include one or more additional data storage units that are external to the computer system 1101, such as located on a remote server that is in communication with the computer system 1101 through an intranet or the Internet.
The computer system 1101 can communicate with one or more remote computer systems through the network 1130. For instance, the computer system 1101 can communicate with a remote computer system of a user (e.g., medical professional, or subject). Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 1101 via the network 1130.
Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 1101, such as, for example, on the memory 1110 or electronic storage unit 1115. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 1105. In some cases, the code can be retrieved from the storage unit 1115 and stored on the memory 1110 for ready access by the processor 1105. In some situations, the electronic storage unit 1115 can be precluded, and machine-executable instructions are stored on memory 1110.
The code can be pre-compiled and configured for use with a machine having a processor adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
Aspects of the systems and methods provided herein, such as the computer system 1101, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
The computer system 1101 can include or be in communication with an electronic display 1135 that comprises a user interface (UI) 1140 for providing, for example, results of nucleic acid sequencing, analysis of nucleic acid sequencing data, characterization of nucleic acid sequencing samples, tissue characterizations, etc. Examples of UI's include, without limitation, a graphical user interface (GUI) and web-based user interface.
Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 1105. The algorithm can, for example, initiate nucleic acid sequencing, process nucleic acid sequencing data, interpret nucleic acid sequencing results, characterize nucleic acid samples, characterize samples, etc.
This study describes the blinded clinical validation of a genomic sequence classifier (GSC), implemented in accordance with the methods described herein, on a prospective multicenter-derived set of patients with FNA samples whose referral to surgery and histopathological diagnosis were determined in the absence of genomic information.
The study was approved by institution-specific institutional review boards as well as by Liberty IRB (DeLand, Fla.; now Chesapeake IRB) and Copernicus Group Independent Review Board (Cary, N.C.). All patients provided written informed consent prior to participating in the study.
The following thyroid nodule FNA samples were included in the training set, with each sample set being independent from one another (Table 1):
ENHANCE Arm 1:
A dedicated molecular sample was obtained when the cytology specimen was collected from a nodule ≥1 cm during clinical care. Arm 2 samples were all unoperated, Bethesda II, or Bethesda III/IV and GEC benign, and lacked 2015 American Thyroid Association high suspicion sonographic pattern findings. Additionally, they had clinical follow-up (mean 23 months, range 17-32) and either a repeat FNA that was cytology benign, or had no growth (<50% increase in volume or <20% increase in 2 or more dimensions) or development of high suspicion ultrasound findings after the initial FNA. Nodules were excluded from Arm 2 if repeat FNA was Bethesda V or VI, GEC suspicious, or they underwent surgery. Arm 2 nodules served as truly benign samples, recognizing that GEC benign samples were underrepresented among operated Arm 1 samples.
ENHANCE Arm 2:
A dedicated molecular sample was obtained when the cytology specimen was collected from a nodule ≥1 cm during clinical care. Arm 2 samples were all unoperated, Bethesda II, or Bethesda III/IV and GEC benign, and lacked 2015 American Thyroid Association high suspicion sonographic pattern findings. Additionally, they had clinical follow-up (mean 23 months, range 17-32) and either a repeat FNA that was cytology benign, or had no growth (<50% increase in volume or <20% increase in 2 or more dimensions) or development of high suspicion ultrasound findings after the initial FNA. Nodules were excluded from Arm 2 if repeat FNA was Bethesda V or VI, GEC suspicious, or they underwent surgery. Arm 2 nodules served as truly benign samples, recognizing that GEC benign samples were underrepresented among operated Arm 1 samples.
VERA-CVP (non Cyto-I) Samples:
Samples described in the clinical validation of the Afirma GEC1 with sufficient materials remaining. Only Bethesda II, V, and VI samples with histopathology labels defined by an expert panel of pathologists were allowed in the training set. 60% of these samples were randomly chosen into the training set.
VERA-Train:
Samples used in the training set of the Afirma GEC.1
VERA-Extra:
Collected and associated with histopathology labels identically to VERA-CVP, but these samples were not used in the training or validation of the Afirma GEC.
CLIA-GEC B:
Samples from the CLIA stream that are GEC Benign. These samples do not have long term follow-up or a histopathology label. Their benign GEC prediction is used as a surrogate label in algorithm training.
Dedicated thyroid nodule FNA specimens and surgical histopathology from nodules 1 cm or larger were collected using a prospective and blinded protocol at 49 academic and community centers in the United States from patients 21 years or older. These samples, stored at −80° C., were previously used to validate the GEC. The details of their enrollment and prespecified inclusion and exclusion criteria have been reported elsewhere. Histopathology diagnoses were previously established by an expert panel of thyroid surgical histopathologists that were blinded to all clinical and molecular data. BRAF V600E DNA mutational reference status was established by testing DNA from all samples with the competitive allele-specific TaqMan polymerase chain reaction, as described below. This independent validation cohort was prespecified and divided into a primary test set comprised of all patients with Bethesda III and IV samples described in the clinical validation of the Afirma GEC with sufficient RNA remaining and a secondary test set comprised of all patients with Bethesda II, V, or VI samples described in the clinical validation of the Afirma GEC with sufficient RNA remaining and not randomly assigned to the training set, as described in Example 1 above.
Reference Methods:
BRAF V600E status—BRAF V600E status was determined from genomic DNA using Competitive Allele Specific Taqman PCR (castPCR™, Thermo Fisher, Waltham, Mass.) for BRAF 1799T>A mutation, as previously described. Briefly, genomic DNA was purified with the AllPrep Micro Kit (Qiagen, Hilden, Germany) and quantified with Quanti-iT PicoGreen dsDNA Assay Kit (Thermo Fisher, Waltham, Mass.). Five ng of DNA was tested with wild-type and mutant assays on an ABI7900HT. Samples were labelled BRAF V600E positive if the variant allele frequency was ≥5% and wild type if the allele frequency was <5%.
Medullary Thyroid Cancer—Histopathology diagnoses, including medullary thyroid cancer, were previously established by an expert panel of thyroid histopathologists while blinded to all clinical and molecular data.
The following steps were implemented to ensure the independent test set was securely blinded throughout algorithm development and validation.
First, each step was documented in a prespecified protocol and time-stamped on execution. Each team member was assigned a single role and allowed access only to information designated for that role. A randomly generated blinded identification number was assigned to each sample in the validation set by information technology engineers who operated independently of all other teams to ensure that all other personnel were unable to link clinical and genomic data. All historic information that may potentially reveal the clinical label on the independent test set was secured in a password-protected folder prior to the start of algorithm development. Information technology engineers conducted performance testing of the validation test set independently of all other teams.
RNA was purified with the AllPrep Micro kit (Qiagen, Hilden, Germany) as previously described. RNA was quantified using the QuantiFluor RNA System (Promega, Madison, Wis.). Fluorescence was read with a Tecan Infinite 200 Pro plate reader (Tecan, Mannedorf, Switzerland). RNA Integrity Number was determined with the Bioanalyzer 2100 (Agilent, Santa Clara, Calif.).
Samples were randomized and plated into 96 well plates according to their random order. Each plate contained Universal Human Reference RNA (Agilent, Santa Clara, Calif.), a benign thyroid tissue control sample, a malignant thyroid tissue control sample, a medullary thyroid carcinoma tissue control sample and 6 FNAs that were run on every plate in the study. Additionally, 3 samples from each plate were randomly selected to be included as technical replicates.
15 ng of total RNA was transferred to a 96 well plate. The TruSeq RNA Access Library Preparation Kit (Illumina, San Diego, Calif.) was adapted for use on the Microlab STAR robotics platform (Hamilton, Reno, Nev.). During library preparation, total RNA is fragmented, reverse transcribed, end-repaired, A-tailed, and Illumina adapters with individual indexes are ligated. Following PCR and AMpure XP (Beckman Coulter, Indianapolis, Ind.) cleanup, library size and quantity was determined with the Fragment Analyzer (Advanced Analytical, Ankeny, Iowa). 250 ng of 4 libraries were combined and sequentially captured with the human exome to remove ribosomal RNA, intronic, and intergenic sequences. Following PCR and AMpure XP (Beckman Coulter, Indianapolis, Ind.) cleanup, library size and quantity were determined with the Bioanalyzer 2100 (Agilent, Santa Clara, Calif.).
Libraries were normalized to 2 nM, pooled to 16 samples per sequencing run, and denatured according to the manufacturer's instructions. 1% phiX library (Illumina, San Diego, Calif.) was spiked into each sequencing run. Denatured and diluted libraries were loaded onto NextSeq 500 machines (Illumina, San Diego, Calif.) and sequenced with a NextSeq v2 High Output 150 cycle kit (Illumina, San Diego, Calif.) for paired end 2×76 cycle sequencing. Sequencing runs were required to have >75% of bases ≥Q30 and <1% phiX error rate.
RNA-seq data was used to generate gene expression counts, identify variants, detect fusion-pairs, and calculate loss of heterozygosity (LOH) statistics. Raw sequencing data (FASTQ file) was aligned to human reference genome assembly 37 (Genome Reference Consortium) using STAR RNA-seq aligner. Expression counts were obtained by HTSeq5 and normalized using DESeq26 accounting for sequencing depth and gene-wise variability. Variants were identified using GATK variant calling pipeline, and fusion-pairs detected using STAR-Fusion. A loss of heterozygosity (LOH) statistic at chromosome and genome level was developed using variants identified genome-wide. The statistic quantifies the magnitude of LOH by calculating the proportion of variants that have a variant allele frequency (VAF; fraction of reads carrying the alternative allele) away from 0.5 (<0.2 or >0.8) after pre-filtering of variants that has a VAF exactly at zero or one, or is located in cytoband regions exhibiting abnormal excess of LOH signatures across all training samples.
To exclude low quality samples from downstream analysis, quality metrics were evaluated against pre-specified acceptance metrics for total numbers of sequenced and uniquely mapped reads, the overall proportion of exonic reads among mapped, the mean per-base coverage, the uniformity of base coverage, and base duplication and mismatch rates. All these QC metrics were generated using RNA-SeQC. Any sample that failed a QC metric was reprocessed from total RNA through library preparation and sequencing if sufficient RNA was available. Only samples passing the quality criteria were used for downstream analysis.
Fine-needle aspiration samples (n=634) were used to build the GSC core ensemble model, as described in Example 1. The ensemble model consists of 12 independent classifiers: 6 are elastic net logistic regression models and 6 are support vector machines. The 6 models within each category differ from each other according to the gene sets used (Table 2).
To minimize overfitting and to accurately reflect classifier performance incorporating random noise, hyperparameter tuning and model selections were performed using repeated nested cross-validation. Hyperparameter tuning was performed within the inner layer of the cross-validation, and the classifier performance was summarized using the outer layer of the 5-fold cross-validation repeated 40 times. For each classifier, the decision boundary was chosen to optimize specificity, with a minimum requirement of 90% sensitivity to detect malignancy.
The locked ensemble model uses a total of 10 196 genes, among which are 1115 core genes (Table 3). These core genes drive the prediction behavior of the model, and the remaining genes improve classifier stability against assay variability.
In addition to the ensemble model described above, the Afirma GSC system includes 7 other components: a parathyroid cassette, a medullary thyroid cancer (MTC) cassette, a BRAFV600E cassette, RET/PTC1 and RET/PTC3 fusion detection modules, follicular content index, Hürthle cell index, and Hürthle neoplasm index. The first 4 are upstream of the ensemble classifier, targeting specific and rare patient subgroups (
Statistical analyses were performed using R statistical software version 3.2.3. Continuous variables were compared using t test, and categorical variables were compared using Fisher exact test. Test performance was evaluated using sensitivity, specificity, and NPV and PPV based on established methods. All confidence intervals are 2-sided 95% CIs and were computed using the exact binomial test. Test performance comparison between the GSC and GEC was done using McNemar χ2 test on the matched data set. Significance level in differential gene expression analysis is reported using a false discovery rate-adjusted P value. Two-sided P values less than 0.05 were used to declare significance.
FNA samples that previously validated the GEC were used to independently validate the GSC. The earlier GEC validation samples were derived from 4812 nodule aspirations prospectively collected from 3789 patients at 49 clinical sites in the United States over a 2-year period. Of the 210 validation samples with corresponding Bethesda III or IV cytology and blinded postoperative consensus histopathology diagnoses, 191 (91.0%) had sufficient residual RNA for GSC testing. These samples from cytologically indeterminate nodules constituted the blinded primary test set.
The previously established thyroid nodule cytological diagnosis was used again. Patient demographic characteristics and baseline data are shown in Table 4. Age, sex, clinical risk factors, nodule size, histology subtype (Table 5), number of FNA passes, prevalence of malignancy (Table 6), and proportion of samples collected at community centers did not differ significantly between the primary study population (n=191) and the GEC clinical validation cohort of samples (n=210), consistent with unbiased drop out.
a Statistical tests were performed to compare the 19 nodules in the GEC validation that were excluded in the GSC validation because of insufficient RNA quantity. The 2 groups differ only on the number of fine-needle aspiration passes, which is not unexpected, as only samples with sufficient remaining RNA were included in the GSC evaluation.
The Standards for Reporting of Diagnostic Accuracy Studies was developed to improve the quality of reporting diagnostic accuracy studies.
aOne sample has no result because of low follicular content that is not summarized in the table.
The GSC correctly identified 41 of 45 malignant samples as suspicious, yielding a sensitivity of 91.1% (95% CI, 79-98), and 99 of 145 nonmalignant samples were correctly identified as benign by the GSC, yielding a specificity of 68.3% (95% CI, 6076). Among Bethesda III and IV samples, the NPV was 96.1% (95% CI, 90-99) and the PPV was 47.1% (95% CI, 36-58). Performance of the GSC was similar between Bethesda III and IV categories (Table 7).
Among the 190 Bethesda III and IV samples, 17 (8.9%) were histologically Hürthle cell adenomas and 9 (4.7%) were Hürthle cell carcinomas, while 164 samples (86.3%) were histologically non-Hürthle. For samples with Hürthle histology, the sensitivity was 88.9% (95% CI, 52-100) and the specificity was 58.8% (95% CI, 33-82). For samples with non-Hürthle histology, the sensitivity was 91.7% (95% CI, 78-98) and the specificity was 69.5% (95% CI, 61-77).
A wide variety of malignant subtypes were correctly classified as suspicious (Table 8). Four false-negative cases occurred (Table 9). Patient age or sex, malignancy subtype, or nodule size by ultrasonography or on histopathology were assessed to determine whether they associated with false-negative cases, and none were. The performance of the GSC in secondary analyses of nodules with Bethesda II, V, or VI cytopathology are reported in Table 7. Among the entire secondary analysis group, the GSC sensitivity was 100% (95% CI, 90-100) and the specificity was 73.1% (95% CI, 52-88).
aAmong the Hürthle cell carcinomas, 7 showed capsular invasion and 2 showed vascular invasion. The false-negative case was previously false-negative on the gene expression classifier.20
bAmong the follicular carcinomas, 3 showed capsular invasion and 4 were well-differentiated carcinomas not otherwise specified.
Genomic sequence classifier to gene expression classifier comparison on a per-samples basis: 190 Bethesda III/IV primary validation samples yielded both GSC and GEC results (
A 2016 meta-analysis reported the risks of malignancy among Bethesda III and IV thyroid nodules to be 17% (95% CI, 11-23) and 25% (95% CI, 20-29), respectively. To safely avoid unnecessary diagnostic surgery among these cytologically indeterminate nodules, a test with a high sensitivity and NPV for malignancy is required. This blinded clinical validation of the GSC in a prospectively collected, representative, universally operated, and histopathologically diagnosed cohort demonstrates the required high NPV across these ranges of cancer prevalence encountered in Bethesda III and IV nodules in clinical practice (
Test sensitivity of the GSC (91%; 95% CI, 79-98) compared with the GEC (89%; 95% CI, 76-96) was maintained, with the point estimate within the counterpart's 95% CI, and the McNemar χ2 test (df=1) on the matched sample set renders a test statistic of 0 (P>0.99). On the other hand, test specificity of the GSC (68%; 95% CI, 60-76) was significantly improved from the GEC (50%; 95% CI, 42-59), with the point estimate outside the counterpart's 95% CI, and the McNemar χ2 test (df=1) on the matched sample set renders a test statistic of 16.447 (P<0.001) (Table 10). In practice, this enhanced performance indicates that among Bethesda III and IV nodules that are histopathologically benign, at least one-third more will receive a benign result using the GSC compared with the GEC (
While genomic data has been incorporated in clinical management decisions of multiple medical conditions for more than a decade, progress continues toward understanding the complexities of genomic and non-genomic pathways in the development and behavior of disease. Current evidence suggests that most common diseases are associated with small effects from a large number of genes and that most of these contributions are derived from transcriptionally active portions of the genome. This implies that diseases such as thyroid cancer are unlikely to be accounted for by the effects of a small number of genes. The fact that few genomic variants are associated with 100% penetrance toward malignant histology suggests that a complex interaction of multiple factors ultimately determines the benign or malignant nature of thyroid nodules. As the number of these factors expands, it becomes critical to use machine learning and statistical models to interpret their signals in a trained model to derive an accurate diagnosis.
Hürthle lesions exemplify the challenges inherent in complex biology and the opportunity to harness high dimensional genomic data for predictive model training and subsequent validation. Most Hürthle cell-dominant Bethesda III and IV thyroid nodules have historically undergone surgery given the potential for Hürthle cell carcinoma, yet most have proven to be histologically benign. The GEC identified these samples at a high NPV, but most were categorized as GEC suspicious. Current methods sought to maintain a high NPV while providing more benign results by including 2 dedicated classifiers to work with the core GSC classifier. Among the 26 Hürthle cell adenomas or Hürthle cell carcinomas reported here, the final GSC sensitivity was 88.9% and the specificity was 58.8%; the GEC sensitivity was 88.9% and the specificity was 11.8% among these same neoplasms. Thus, while the overall GSC sensitivity of 91.1% reported here is comparable with that of the GEC (by design), the improved overall GSC specificity of 68.3% results from significantly improved performances among both Hürthle and non-Hürthle specimen types. Given that most histologically benign Hürthle and non-Hürthle specimens are now both identified as GSC benign, GSC testing may further safely reduce unnecessary surgery among both specimen types.
A secondary analysis of 61 Bethesda II, V, or VI samples that also were included in the GEC validation study is included in Table 7. The consistency of these performance metrics within the Bethesda III and IV categories is reassuring and supportive of the findings in the primary analysis.
Methods and systems of the present disclosure may be combined with or modified by other methods or systems, such as, for example, those described in U.S. Pat. No. 8,541,170, U.S. Patent Publication No. 2018/0157789, and U.S. Patent Publication No. 2018/0016642, each of which is entirely incorporated herein by reference.
While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
This application is a continuation of International Patent Application No. PCT/US2018/043984, filed Jul. 26, 2018, which claims to the benefit of U.S. Provisional Application No. 62/537,646, filed Jul. 27, 2017, and U.S. Provisional Application No. 62/664,820, filed Apr. 30, 2018, each of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62537646 | Jul 2017 | US | |
62664820 | Apr 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2018/043984 | Jul 2018 | US |
Child | 16751606 | US |