The instant application contains a Sequence Listing which has been submitted electronically in ASCII format and is hereby incorporated by reference in its entirety. Said ASCII copy, created Jun. 11, 2021, is named “ACTG-7PCT_ST25.txt” and is 6,293 bytes in size
This disclosure is related to the fields of molecular diagnostics, cancer genomics, and molecular biology.
Microsatellite instability (MSI) is a molecular phenotype indicative of underlying genomic hypermutability. The gain or loss of nucleotides from microsatellite tracts can arise from impairments in the mismatch repair (MMR) system, limiting the correction of spontaneous mutations in repetitive DNA sequences. MSI-affected tumors may, accordingly, be caused by mutational inactivation or epigenetic silencing of genes in the MMR pathway. MSI has been associated with improved prognosis. The ability of MSI to predict pembrolizumab response has led to the first tumor-agnostic drug approval by the FDA in May 2017. Additional evidence showed an improved response for microsatellite instability-high (MSI-H) patients to the anti-PD-1 agents nivolumab and MED10680, the anti-PD-L1 agent durvalumab, and the anti-CTLA-4 agent ipilimumab. With these results, MSI-H has been approved as the molecular marker for immune checkpoint inhibitors.
MSI is typically detected through PCR assay (MSI-PCR) by fragment analysis (FA) using the peak pattern of five microsatellite loci to determine the MSI status of individual samples. Samples with two or more unstable microsatellites are referred to as MSI-High, whereas samples with one or no unstable microsatellite detected are referred to as MSS. However, since each microsatellite locus should be evaluated by comparing the paired tumor and normal tissue, MSI-PCR assay is not always feasible for cases with limited tissue samples, especially the sample containing few normal cells. Immunohistochemistry (IHC) is another typical assay that may be used for MSI status detection. It detects samples with MSI through MMR protein expression testing. However, MMR-IHC cannot always detect loss of mutated proteins resulting from missense mutations and may have normal staining even for some protein-truncating mutations. Further, interpretation of both MSI-PCR and IHC data is manual and qualitative. There is a need in the art for developing a quantitative assay to determine the MSI status efficiently and accurately for patients. Currently several next-generation sequencing (NGS) assays are found to be feasible to determine MSI status. In general, NGS-based MSI testing offers the advantage of providing automated analysis based on quantitative statistics, which reduces analysis time and the variation derived from inter-observer and inter-laboratory compared to MSI-PCR assay. However, some NGS-based MSI-detection methods such as MANTIS and MSIsensor require a matched-normal sample for the evaluation. For other methods, e.g., MSIplus, though do not require a matched-normal sample in the assay, further improvement like adding more microsatellite loci may be needed. There is still space for improving NGS-based MSI testing
The present disclosure provides improved techniques for determining MSI status. The present disclosure uses a trained machine learning model to determine MSI status from large-panel clinical targeted NGS data accounting for at least six microsatellite loci, and preferably at least one hundred microsatellite loci. The trained machine learning model uses different weights on the different features, e.g., peak width, peak height, peak location, and simple sequence repeat (SSR) type, to achieve high robustness and efficiency for MSI status detection from NGS data without matched normal sample. Furthermore, through validating the trained machine learning model using an independent dataset of clinical samples across various cancer types, the trained machine learning model is proved to have high sensitivity and specificity for MSI status detection.
In one general aspect, the disclosure relates to a method of generating a model for predicting a MSI status, including:
In some embodiments, the MSI feature data is calculated by a baseline. In some embodiments, the baseline for calculating the MSI feature data is established by normal samples or samples with MSS status. In some embodiments, the baseline is established from the mean of each the MSI feature of each SSR region across the normal samples. Preferably, the baseline is established from the mean peak width of each SSR region.
In some embodiments, the estimated MSI status data is retrieved from a cancer patient through known assay method including but not limited to MSI-PCR assay, IHC, NGS-based MSI testing including MANTIS, MSIsensor, MSIplus, or Large Panel NGS. In some embodiments, the MSI status is microsatellite stability (MSS) or MSI-H. In some embodiments, the MSI features include peak width, peak height, peak location, SSR type, or any combination thereof.
In some embodiments, the machine learning model includes but is not limited to regression-based models, tree-based models, Bayesian models, support vector machines, boosting models, or neural network-based models. In some embodiments, the machine learning model includes but is not limited to a logistic regression model, a random forest model, an extremely randomized trees model, a polynomial regression model, a linear regression model, a gradient descent model, and an extreme gradient boost model.
In some embodiments, the trained machine learning model includes a defined weight of each microsatellite locus. In some embodiments, the trained machine learning model includes a defined weight of the MSI feature in each microsatellite locus. The trained machine learning model is predictive of MSI status.
In some embodiments, the machine learning model has a cutoff value of 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, or 0.5.
In some embodiments, the estimated MSI status data or the computed MSI status data indicates microsatellite stability (MSS) or microsatellite instability-high (MSI-H).
In another general aspect, the disclosure relates to a computer-implemented method for determining MSI status, including:
In some embodiments, the computer-implemented method further includes step (f): outputting the computed MSI status data to an electronic storage medium or a display.
In some embodiments, the method further includes a step of identifying a treatment for a subject based on the computed MSI status data and/or administering a therapeutically effective amount of treatment to the subject.
In some embodiments, the treatment includes but is not limited to surgery, individual therapy, chemotherapy, radiation therapy, immunotherapy, or any combination thereof. In some embodiments, the immunotherapy includes administering the drug including but not limited to anti-PD-1 agents pembrolizumab, nivolumab and MED10680, anti-PD-L1 agent durvalumab, and anti-CTLA-4 agent ipilimumab.
In some embodiments, the microsatellite loci is at least 7, 10, 15, 20, 30, 40, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, or 600 loci. In some embodiments, the microsatellite loci are identifying by sequencing SSR regions in the chromosomal regions. In some embodiments, the microsatellite loci are excluded due to low coverage, unstable peak call, high variability in peak width, or low weight. In some embodiments, the microsatellite loci with high variability in peak width has a peak width greater than 2 in 5 replicate runs, 3 in 6 replicate runs, 3 in 7 replicate runs, 3 in 8 replicate runs, 3 in 9 replicate runs, or 4 in 10 replicate runs.
In some embodiments, the sample originates from a cell line, biopsy, primary tissue, frozen tissue, formalin-fixed paraffin-embedded (FFPE), liquid biopsy, blood, serum, plasma, buffy coat, body fluid, visceral fluid, ascites, paracentesis, cerebrospinal fluid, saliva, urine, tears, seminal fluid, vaginal fluid, aspirate, lavage, buccal swab, circulating tumor cell (CTC), cell-free DNA (cfDNA), circulating tumor DNA (ctDNA), DNA, RNA, nucleic acid, purified nucleic acid, purified DNA, or purified RNA.
In some embodiments, the sample is a clinical sample. In some embodiments, the sample originates from a diseased patient. In some embodiments, the sample originates from a patient having cancer, solid tumor, hematologic malignancy, rare genetic disease, complex disease, diabetes, cardiovascular disease, liver disease, or neurological disease. In some embodiments, the sample originates from a patient having Adenocarcinoma, Adenoid cystic carcinoma, Adrenal cortical carcinoma, Ampulla Vater cancer, Anal cancer, Appendix cancer, Basal ganglia glioma, Bladder cancer, Brain cancer, Brain tumor glioma, Breast cancer, Buccal cancer, Cervical cancer, Cholangiocarcinoma, Chondrosarcoma, Clear cell carcinoma, Colon cancer, Colorectal cancer, Cystic duct carcinoma, Dedifferentiated liposarcoma, Desmoid, Diffuse midline glioma, Endometrial cancer, Endometrioid adenocarcinoma, Epithelioid rhabdomyosarcoma, Esophageal cancer, Extraskeletal chondroblastic osteosarcoma, Eyelid sebaceous carcinoma, Fallopian tube cancer, Gallbladder cancer, Gastric Cancer, Gastrointestinal stromal tumor, Glioblastoma multiforme, Head and Neck Cancers, Hepatocellular carcinoma, High grade glioma, Hypopharyngeal Cancer, Intima sarcoma, Infantile fibrosarcoma, Invasive ductal carcinoma, Kidney cancer, Leiomyosarcoma, Liposarcoma, Liver angiosarcoma, Liver cancer, Lung cancer, Melanoma, Metastasis of unknown origin, Nasopharyngeal cancer, NSCLC adenocarcinoma, Oesophageal cancer, Oral Cancer, Oropharyngeal cancer, Osteosarcoma, Ovarian cancer, Pancreatic cancer, Papillary Thyroid Carcinoma, Peritoneal cancer, Primary peritoneal serous carcinoma, Prostate cancer, Rectal cancer, Renal cancer, Salivary gland cancer, Sarcomatoid Carcinoma, Sigmoid cancer, Sinus cancer, Skin cancer, Soft tissue sarcoma, Squamous cell carcinoma, Stomach adenoacrinoma, Submandibular gland cancer, Thymic cancer, Thymoma involvement, Thyroid cancer, Tongue cancer, Tonsillar cancer, Transitional cell carcinoma, Uterine cancer, Uterine sarcoma, or Uterus leiomyosarcoma. In some embodiments, the sample originates from a pregnant woman, a child, an adolescent, an elder, or an adult. In some embodiments, the sample is a research sample. In some embodiments, the sample originates from a group of samples. In some embodiments, the group of samples is from related species. In some embodiments, the group of samples is from different species.
In some embodiments, the machine learning model is trained by using a training set having MSI status data and MSI feature data.
In some embodiments, the NGS system includes but not limited to the MiSeq, HiSeq, MiniSeq, iSeq, NextSeq, and NovaSeq sequencers manufactured by Illumina, Inc., Ion Personal Genome Machine (PGM), Ion Proton, Ion S5 series, and Ion GeneStudio S5 series manufactured by Life Technologies, Inc., BGlseq series, DNBseq series and MGlseq series, manufactured by BGI, and MinION/PromethION sequencers manufactured by Oxford Nanopore Technologies.
In some embodiments, the sequencing reads are generated from nucleic acids that are amplified from the original sample or the nucleic acids captured by the bait. In some embodiments, the sequencing reads are generated from a sequencer that required the addition of an adapter sequence. In some embodiments, the sequencing reads are generated from a method that includes but is not limited to hybrid capture, primer extension target enrichment, a molecular inversion probe-based method, or multiplex target-specific PCR.
In another general aspect, the disclosure relates to a system for determining MSI status. The system includes a data storage device storing instructions for determining characteristics of MSI status and a processor configured to execute the instructions to perform a method. Further, the method includes the following steps:
One or more embodiments are illustrated by ways of example, and not by limitation, in the figures of the accompanying drawings, wherein elements are having the same reference numeral designations represent like elements throughout. The drawings are not to scale unless otherwise disclosed.
The drawings are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes. The dimensions and the relative dimensions do not necessarily correspond to actual reductions to the practice of the disclosure.
The making and using of the embodiments of the disclosure are discussed in detail below. It should be appreciated, however, that the embodiments provide many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the embodiments and do not limit the scope of the disclosure.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by a person skilled in the art to which this disclosure belongs. As used herein, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise.
As used herein, “microsatellite” means a tract of repetitive DNA in which certain DNA motifs are repeated. “Microsatellite loci” refers to the regions of the microsatellite. The terms “microsatellite” and “SSR,” as well as “microsatellite loci” and “SSR region” are used interchangeably, respectively, where the context allows. In some embodiments of the disclosure, type of microsatellite loci or SSR region refers to mono-, di-, tri-, tetra, or pentanucleotide repeats or certain complex nucleotide type in a nucleotide sequence. Preferably, type of the microsatellite loci or SSR region refers to mononucleotide with at least ten repeats, dinucleotide with at least six repeats, trinucleotide with at least five repeats, tetranucleotide with at least five repeats, pentanucleotide with at least five repeats, and the complex nucleotide type including but not limited to SEQ ID NOs: 1-37.
As used herein, “MSI status” or “MMR status” refers to the presence of “MSI” or “unstable microsatellite (loci),” a clonal or somatic change in the number of repeated DNA nucleotide units in microsatellites. The present disclosure estimates the MSI status as MSS or MSI-H. “MSI-H” refers to those in which the number of repeats present in microsatellite loci differs significantly from the number of repeats that are in the DNA of a normal cell. “MSS” refers to those who have no functional defects in DNA MMR and have no significant differences between tumor and normal cell in microsatellite loci.
As used herein, “cutoff value” or “threshold” refers to a numerical value or other representation whose value is used to arbitrate between two or more states of classification for a biological sample. In some embodiments of the disclosure, the cutoff value is set according to the training result of the machine learning model and is used to distinguish between MSI-H and MSS. If the MSI score is greater than the cutoff value, the MSI status is determined as MSI-H; or if the MSI score is less than the cutoff value, the MSI status is determined as MSS.
As used herein, “peak” refers to a microsatellite distribution pattern in the microsatellite loci. The peak may be analyzed using data generated by next-generation sequencing, where the number of allele repeat length within each microsatellite locus is considered as peak width, the read counts of the most frequently observed allele is referred to as peak height, and the location difference between the peak height in each microsatellite locus of tumor tissue and reference genome is referred to as peak location. In some embodiments of the disclosure, peak width, peak height, or peak location are used as MSI features to estimate the MSI status.
As shown in
Generally, to understand the MSI status, a matched paired analysis would be performed to identify microsatellite loci in the tumor that are different compared to matched normal tissue. “Matched normal tissue” or “normal pair tissue” as used herein refers to normal tissue from the same patient. However, in some embodiments of the disclosure, the machine learning model detects MSI status from NGS data without matched normal tissue. A pooled normal sample is used to establish the mean of each the MSI feature of each SSR region across the normal population as a baseline for MSI detection. Data from individual clinical tumor tissue will be compared to the peak pattern of the baseline data to determine microsatellite status for each SSR region in that sample.
As used herein, “tumor purity” is the proportion of cancer cells in a tumor sample. Tumor purity impacts the accurate assessment of molecular and genomics features as assayed with NGS approaches. In some embodiments of the disclosure, the clinical sample has a tumor purity at least 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100%. Preferably, the present disclosure disclosure identifies the sample within the tumor purity at least 20%.
As used herein, “depth” or “total depth” refers to the number of sequencing reads per location. “Mean depth,” “mean total depth,” or “total mean depth” refers to the average number of reads across the entire sequencing region. Generally, the total mean depth has an impact on the performance of the NGS assay. The higher the mean total depth, the lower the variability in the variant frequency of the variant. In some embodiments of the disclosure, the mean depth of the sample across the entire sequencing region is at least 200x, 300x, 400, 500x, 600x, 700x, 800x, 900x, 1000x, 2000x, 3000x, 4000x, 5000x, 6000x, 8000x, 10000x, or 20000x. Preferably, the mean depth of the sample across the entire sequencing region is at least 500x.
As used herein, “coverage” refers to the total depth at a given locus and can be used interchangeably with “depth.” In some embodiments of the disclosure, “low coverage” means the read depth lower than 5x, 10x, 15x, 20x, 25x, 30x, 35x, 40x, 45x, or 50x from a sample on a locus.
As used herein, “target base coverage” refers to the percentage of the sequenced region that is sequenced at a depth above a predefined value. Target base coverage needs to specify the depth at which it is evaluated. In some embodiments, the target base coverage at 100x is 85%. That means 85% of the target sequenced bases is covered by at least 100x depth of sequencing reads. In some embodiments, the target base coverage at 30x, 40x, 50x, 60x, 70x, 80x, 90x, 100x, 125x, 150x, 175x, 200x, 300x, 400x, 500x, 750x, 1000x is above 70%, 75%, 80%, 85%, 90%, or 95%.
As used herein, “human subject” refers to those with formally diagnosed disorders, those without formally recognized disorders, those receiving medical attention, those at risk of developing the disorders, etc.
As used herein, “treat,” “treatment,” and “treating” includes therapeutic treatments, prophylactic treatments, and applications in which one reduces the risk that a subject will develop a disorder or other risk factor. Treatment does not require the complete curing of a disorder and encompasses embodiments in which one reduces symptoms or underlying risk factors.
As used herein, “therapeutically effective amount” means an amount of a therapeutically active molecule needed to elicit the desired biological or clinical effect. In preferred embodiments of the disclosure, “a therapeutically effective amount” is the amount of drug needed to treat cancer patients with MSI-H.
The present disclosure is further illustrated by the following Examples, which are provided for the purpose of demonstration rather than limitation.
Formalin-fixed paraffin-embedded (FFPE) samples were prepared from cancer patients through surgical or needle biopsy samples. Genomic DNA was extracted using QIAamp DNA FFPE Tissue Kit (QIAGEN, Hilden, Germany). Eighty nanograms of DNA were amplified using multiplexed PCR targeting a panel of 440 genes and 1.8 Mbps. The samples were sequenced by using Ion Proton or Ion S5 Prime (Thermo Fisher Scientific, Waltham, Mass.) system with the Ion PI or 540 Chip (Thermo Fisher Scientific, Waltham, Mass.) following manufacturer recommended protocol. Raw sequence reads were processed by the manufacturer-provided software Torrent Variant Caller (TVC) v5.2, and .bam and .vcf files were generated.
(1) Candidate Loci Selection
Using the MIcroSAtellite identification tool (MISA, Beier, Thiel, Munch, Scholz, & Mascher,
2017), SSR regions in the chromosomal regions covered by the ACTOnco Panel assay were identified. A total of 600 SSR regions, including mononucleotide with at least ten repeats, dinucleotide with at least six repeats, trinucleotide with at least five repeats, tetranucleotide with at least five repeats, pentanucleotide with at least five repeats, and complex nucleotide type, were identified by MISA. The sequences of the complex SSR regions are provided in Table 1.
Note: The uppercase sequences in parenthesis are the sequences being repeated by the number of times indicated by the number following it. Lowercase sequences not in parenthesis are sequences between two repetition regions within one identified loci.
We first examined the chromosomal location of each SSR region. A total of 34 SSR loci were found located on the X chromosome and were excluded.
In order to develop a robust MSI prediction algorithm for ACTOnco assay, we plan to include only SSR regions from the remaining 566 candidate loci, which shows reproducible peak patterns in clinical FFPE samples in the prediction model. To identify SSRs with good reproducibility across different sequencing runs, we examined the coverage and peak pattern of the 566 SSR regions in a set of 10 FFPE clinical samples across six replicate runs.
In order to include only highly confident reads on each SSR region for the prediction model, a minimum read depth of 30x from a sample on a locus was required. Additionally, to determine the total number of repeats of different lengths (peak width) on a SSR region, a minimum of 5% of allele frequency for a repeated length was required to be included. For example, for a sample on a locus with segments of mononucleotide repeats, if the allele frequencies are detected as 2% for 15 bases, 10% for 16 bases, 20% for 17 bases, 30% for 18 bases, 20% for 19 bases, 10% for 20 bases, and 8% for 21 bases, the total number of repeats of different lengths (peak width) will be 6 with the length of 15 bases uncounted.
We excluded 138 SSR regions due to their low coverage (<30 reads for the SSR region), unstable peak call (missing peak width data in any sequencing run), high variability in peak width (variation in peak width greater than 3 in 6 replicate runs) or low weight (the MSI feature data around the last 5% contributions to the prediction model). The remaining 428 microsatellite loci were used for the subsequent baseline establishment and model training.
(2) Baseline Establishment
Population baseline for all 428 loci was established. The mean peak width of 77 normal samples sequenced in the Ion Proton sequencer was used to establish a baseline. The mean peak width of 81 normal samples sequenced in the Ion S5 Prime sequencer was used to establish another baseline. The MSI baseline was established from the mean peak width of each SSR region across the normal population. The standard deviation of peak width was also calculated for each candidate locus. For a given locus, it is considered unstable if the difference in peak width between a given clinical sample and the baseline falls outside of two times the standard deviation. The total unstable loci percentage is calculated by dividing the number of unstable loci by the total number of loci used.
(3) MSI Prediction Model and Model Validation
A total of 122 colorectal cancer (FFPE samples) sequenced on Ion Proton and Ion S5 Prime were used in training the machine learning model. Of those samples, 76 are MSS, and 46 are MSI-H samples based on a 5-marker MSI-PCR detection system (Promega MSI Analysis System, version 1.2). For each sample, the loci with read depth less than 30x were not considered in model training and were reported as missing information. Additionally, to determine the peak width on a SSR region, a minimum of 5% of allele frequency for a repeated length (allele) was required to be included in training the model. The difference in the peak width between the MSS baseline and clinical samples were used for calculation in the following logistic regression model:
MSI status (MSS/MSI-H)=β0+β1loci1+β2loci2+β3loci3+ . . . +β428loci428 where β is a weight.
We divided 122 training data by 7:3 ratio for training and testing and randomly assigned samples to train and test the data for 1000 iterations. Due to the small sample size, all 122 training data were used to set the cutoff value. The MSI score used for setting the cutoff value is calculated by selecting the median MSI score for each sample when it is selected as testing data during the 1000 iterations. The ROC curve for the model performance is shown in
We next used an independent set of 439 clinical FFPE samples, including 30 MSI-H and 409 MSS samples, to validate the MSI model. Samples include but are not limited to lung cancer, colorectal cancer, breast cancer, ovarian cancer, pancreatic cancer, cholangiocarcinoma, gastric cancer, glioblastoma, sarcoma, cervical cancer, leiomyosarcoma, and liposarcoma. These samples were processed using the same method as described in Example 1 to sequence the 428 loci region to a mean sequencing depth of at least 500x and 85% of the target region reaching a target base coverage of 100x.
Total of three cancer cell lines with MSI-H were utilized (where they come from) for the determination of the lowest amount of tumor purity required to determine MSI status. These three cancer cell lines were diluted with their own matched normal cell to form a series of diluted samples with 100%, 80%, 50%, 40%, 30%, and 20% of tumor content. The MSI score for each of these samples is shown in Table 5.
This application claims priority of Provisional Application No. 63/041,103, filed on Jun. 18, 2020, the content of which is incorporated herein in its entirety by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/037969 | 6/18/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63041103 | Jun 2020 | US |