Immune checkpoints generally refer to a set of inhibitory pathways hardwired into the immune system, which regulate the duration and amplitude of physiological immune responses. When activated, the immune checkpoint molecules (e.g., PD-1) suppresses the immune system in order to prevent it from attacking cells indiscriminately. Although immune checkpoints are generally effective, tumor cells may manipulate such mechanism to prevent the immune system from eliminating tumor cells.
Immune checkpoint blockade therapy is a recent treatment to counter the mechanism of tumor cells. Immune checkpoint blockade therapies use medications such as immune checkpoint inhibitors to activate the immune system to recognize and eliminate cancerous cells. The immune checkpoint blockade therapies enable the immune system to properly recognize and eliminate tumor cells that present neoantigens via major histocompatibility complexes (MHC). Despite this early success, a large percentage of subjects do not respond to these therapies, due to complex tumor intrinsic and extrinsic mechanisms of tumor cells to resist and evade immune checkpoint blockade therapies. Elucidating the cause of such immune checkpoint blockade resistance has proven to be more challenging than initially anticipated.
One of the mechanisms causing immune checkpoint blockade resistance may include loss of heterozygosity in human leukocyte antigen (HLA) genes. A neoantigen corresponding to a mutated gene of a tumor cell can bind to a HLA protein encoded by a particular HLA allele and be presented on the cell surface. When the presented neoantigen is detected, the immune system can respond by deploying T cells that identify and eliminate the tumor cell by detecting the presented neoantigen. Thus, effectiveness of the immune system may depend on whether the neoantigen is presented on the tumor cell surface. Conversely, preventing the presentation of neoantigens can result in the T cells being unable to detect the corresponding tumor cells.
Various studies suggest that tumor cells often have loss of heterozygosity in HLA genes, such that the corresponding HLA proteins of the deleted HLA alleles are not available to present the neoantigens on tumor cell surfaces. For example, each human subject has six different HLA alleles capable of presenting a diverse set of antigens to the immune system. The germline sequence diversity of HLA alleles can impact tumor evolution by mediating the presentation of neoantigens to the immune system. This impact of HLA sequence diversity appears to be more pronounced in the presence of the immune checkpoint blockade therapies. As tumor cells mutate, somatic loss of heterozygosity in the HLA allelic regions can occur, thereby causing reduction in HLA sequence diversity. Such loss of heterozygosity of HLA alleles is increasingly being recognized as a cause of immune checkpoint blockade resistance by tumor cells.
Thus, detecting loss of heterozygosity of HLA alleles from sequencing data can be beneficial in anticipating immune checkpoint blockade resistance and developing a corresponding therapy for a given subject. However, conventional techniques can be deficient in accurately detecting loss of heterozygosity of HLA alleles. For example, a conventional technique for detecting HLA loss of heterozygosity can include performing a genome-wide interrogation for detecting copy numbers. In this technique, a decrease of copy numbers around the HLA genes may indicate its loss of heterozygosity. This conventional technique, however, can be unreliable in detecting HLA loss of heterozygosity from sequencing data, for at least the following reasons. First, the polymorphic nature of the mutated genes causes poor alignment of corresponding sequence reads to the reference genome. Second, the complexity of the sequence variation can obscure the specific HLA allele that has been deleted, which is information crucial for neoantigen therapy design.
Another conventional technique can include identifying copy number variations of HLA genes after alignment of sequence data to HLA allele-specific reference sequences. However, most allele-specific alignment techniques relied on by the conventional copy number variant algorithms fail to account for HLA-specific challenges such as differences in exome probe capture between alleles. Moreover, copy number variant algorithms can be notoriously poor for biological samples with low tumor purity and have trouble detecting subclonal deletions, thereby raising concerns regarding the sensitivity and accuracy of these algorithms. Thus, despite growing interest, conventional techniques end up relying on deletions of flanking regions surrounding the HLA allelic region as a proxy for HLA loss of heterozygosity, rather than developing an HLA loss of heterozygosity specific algorithm. In view of the above, it is challenging to accurately detect HLA loss of heterozygosity.
Moreover, validating performance of HLA loss of heterozygosity detection algorithms has been an additional challenge in the field. For example, a conventional technique includes assessing concordance between HLA loss of heterozygosity calls and copy number calls made by a standard CNV algorithm in regions flanking each HLA gene. Another conventional technique includes designing primers to capture the regions surrounding the HLA genes applying PCR to identify copy number loss of HLA alleles in subjects. However, neither of these approaches validates the identity of specific HLA allele that can be lost nor addresses the accuracy of calls for low tumor purity samples or samples with HLA loss of heterozygosity subclonality.
In some embodiments, a method of detecting loss of heterozygosity in HLA alleles is provided. The method can include accessing a trained machine-learning model, which was trained using a training data set that included one or more sets of training features corresponding to an HLA allele identified in a tumor sample corresponding to a subject of a set of subjects. A first set of training features includes, for a genomic region of the HLA allele: (i) an adjusted B allele frequency that represents a ratio between a first B allele frequency of heterozygous alleles in the tumor sample that correspond to the genomic region and a second B allele frequency of heterozygous alleles in the genomic region and associated with one or more control samples; and (ii) a ratio between a first allele-specific coverage of the tumor sample that corresponds to the genomic region and a second allele-specific coverage of the one or more control samples that corresponds to the genomic region. A second set of training features includes, for the HLA allele, an indication of whether at least part of a flanking genomic region surrounding the HLA allele has been deleted.
The method can also include receiving sequence data corresponding to a biological sample of a particular subject. The method can also include using the machine-learning model to generate a result corresponding to a probability of whether a loss of heterozygosity exists in an HLA allele identified in the biological sample of the particular subject by processing the sequence data using the machine-learning model. The method can also include outputting the result.
Some embodiments of the present disclosure include a system including one or more data processors. In some embodiments, the system includes a non-transitory computer readable storage medium containing instructions which, when executed on the one or more data processors, cause the one or more data processors to perform part or all of one or more methods and/or part or all of one or more processes disclosed herein. Some embodiments of the present disclosure include a computer-program product tangibly embodied in a non-transitory machine-readable storage medium, including instructions configured to cause one or more data processors to perform part or all of one or more methods and/or part or all of one or more processes disclosed herein.
The terms and expressions which have been employed are used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed. Thus, it should be understood that although the present invention as claimed has been specifically disclosed by some embodiments and optional features, modification and variation of the concepts herein disclosed may be resorted to by those skilled in the art, and that such modifications and variations are considered to be within the scope of this invention as defined by the appended claims.
The present disclosure is described in conjunction with the appended figures:
As described above, accurate detection of HLA loss of heterozygosity can significantly improve accuracy and effectiveness of cancer immunotherapies, including immune checkpoint blockade therapies. Although allele-specific alignment techniques may be an improvement over other genomic interrogation techniques, variability in exome capture across alleles and the relatively short sequences of the HLA genes introduce additional challenges for identifying changes in copy number. Moreover, conventional techniques cannot accurately and comprehensively determine the limit of detection, sensitivity, and specificity of HLA loss of heterozygosity detection algorithms.
To address at least the above deficiencies of conventional systems, the present techniques can be used, the present techniques can use a machine-learning approach to detect loss of heterozygosity in HLA alleles. A machine learning model for identifying deletion of allele-specific HLAs (“the DASH model”) can be accessed. In particular, the DASH model can be trained using training data that include, for a subject of a set of subjects, the following features: (1) allele-specific features; (2) subject-specific features; and (3) whole-exome features. The allele-specific features can include, for a genomic region of an HLA allele: an adjusted B allele frequency that represents a ratio between a first B allele frequency of heterozygous alleles in the tumor sample that correspond to the genomic region and a second B allele frequency of heterozygous alleles in the genomic region and associated with one or more control samples; and a ratio between a first allele-specific coverage of the tumor sample that corresponds to the genomic region and a second allele-specific coverage of the one or more control samples that corresponds to the genomic region. By collectively using the above training features, the DASH model can be trained to accurately detect loss of heterozygosity in HLA alleles. In some instances, the allele-specific features correspond to a genomic region of an HLA allele that was identified as having a somatic mutation.
As referred herein, a B allele frequency is a normalized measure of the allelic intensity ratio of two alleles (A and B), such that a B allele frequency of 1 or 0 indicates the complete absence of one of the two alleles (e.g. AA or BB), and a B allele frequency of 0.5 indicates the equal presence of both alleles (e.g. AB). For example, a first B allele frequency can indicate, for a given genomic position, an allelic intensity ratio between HLA-B*46:01:01 and HLA-B*13:01:01 that corresponds to a normal biological sample. A second B allele frequency can indicate, for the same genomic position, an allelic intensity ratio between HLA-B*46:01:01 and HLA-B*13:01:01 that corresponds to a tumor sample. The adjusted B allele frequency can be a ratio determined by dividing the first B allele frequency with the second B allele frequency (or vice versa).
The subject-specific features can include an estimated tumor purity value and an estimated tumor ploidy value corresponding to a tumor sample of the subject. Tumor purity, as used herein, refers to as a ratio of tumor cells to total cells in the sample. Tumor ploidy, as used herein, refers to an average copy number of the entire tumor genome. The whole-exome features can include, for the HLA allele, an indication of whether at least part of a flanking genomic region surrounding the HLA allele has been deleted.
The DASH model trained with the above training features may then be used to process sequence data and generate a result corresponding to a probability of whether a loss of heterozygosity exists in an HLA allele identified in the biological sample of the particular subject. The sequence data corresponding to a biological sample of a particular subject can be accessed. As used herein, sequence data refers to data corresponding to a biological sequence corresponding to nucleic acid (e.g., DNA, RNA) or protein (e.g., alanine arginine). In some instances, sequence data includes one or more sequence reads. The sequence data can be generated by using whole genome sequencing or whole exome sequencing on the biological sample to generate a plurality of sequence reads. After the sequence data is generated, one or more HLA alleles can be identified from the sequence data.
Reference sequences corresponding to the identified HLA alleles can be retrieved, and the sequence reads can be aligned to the retrieved reference sequences. After alignment, allele-specific data for each of the identified HLA alleles corresponding to the sequence data can be identified. In some instances, the allele-specific data identifies a number of sequence reads that align to each genomic region corresponding to the identified HLA alleles.
The trained DASH model uses the allele-specific data for each of the identified HLA alleles as an input to generate the result. Other types of information corresponding to the identified HLA alleles (e.g., an indication of whether at least part of a flanking genomic region surrounding the HLA allele has been deleted) can be used as additional input to the trained DASH model. The trained DASH model that includes one or more gradient boosting algorithms can process the above features of the sequence data to generate the result. In some instances, the result is used to predict a decrease in efficacy of an immune checkpoint blockade therapy being administered to the particular subject.
Accordingly, some embodiments of the present disclosure provide a technical advantage over conventional systems by accurately detecting loss of heterozygosity in HLA alleles. For example, the DASH model can accurately detect loss of heterozygosity in HLA alleles by processing whole-exome sequencing data, which differs from conventional techniques that rely on sequence reads corresponding to the HLA alleles only. Taking into account that the HLA genes are relatively short and most deletions involve much larger genomic regions, the DASH model uses the entire whole-exome platform to incorporate sequence information from around the HLA genes as well inside them. As a result, the detection accuracy of the DASH model can be validated with sensitivity levels at 100% for samples having tumor purity levels above 8% and specificity levels at 100% for samples across all tumor purity levels. Thus, the accurate detection of HLA loss of heterozygosity facilitates investigation of tumor cell mechanisms contributing immune checkpoint blockade resistance and development of new cancer immunotherapies. Moreover, some embodiments of the present disclosure can use allele-specific features to accurately identify which genomic region of the HLA allele can be deleted, thereby detecting loss of heterozygosity with increased granularity.
The following examples are provided to introduce certain embodiments. In the following description, for the purposes of explanation, specific details are set forth in order to provide a thorough understanding of examples of the disclosure. However, it will be apparent that various examples may be practiced without these specific details. For example, devices, systems, structures, assemblies, methods, and other components may be shown as components in block diagram form in order not to obscure the examples in unnecessary detail. In other instances, well-known devices, processes, systems, structures, and techniques may be shown without necessary detail in order to avoid obscuring the examples. The figures and description are not intended to be restrictive. The terms and expressions that have been employed in this disclosure are used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof. The word “example” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment or design described herein as an “example” is not necessarily to be construed as preferred or advantageous over other embodiments or designs.
At step 110, whole exome library preparation and sequencing can be executed to generate sequence reads corresponding to each biological sample. In some instances, whole genome sequencing is performed to generate the sequence reads. DNA from tumor and peripheral blood mononuclear cells/adjacent samples can be used to construct whole-exome capture libraries, in which the libraries are built based on two whole-exome sequencing (WES) capture kits: Agilent SureSelect Human All Exon v5 plus untranslated regions and Agilent SureSelect Clinical Research Exome. Modifications can be made to sequencing protocols to yield an approximately 250 bp average library insert size (for example). In some instances, various polymerases are used to generate the sequence reads, including a KAPA HiFi DNA Polymerase and Herculase II DNA polymerase (for example). Sequencing can be performed at 20G sequencing depth for normal samples and 35G sequencing depth for tumor samples. Using the above example sequencing methods, over >300× coverage exome-wide can be available for >20,000 genes, and >1000× coverage can be available for boosted regions including over 500 cancer-associated genes, including HLA-A, -B and -C alleles.
At step 115, HLA genotyping (alternatively referred to as “HLA typing”) can performed to identify one or more HLA alleles from the sequence reads. HLA types can be calculated up to 6 digits. In some instances, the sequence reads corresponding tumor samples are processed to identify somatic mutations corresponding to the one or more HLA alleles. In some instances, additional types of data can be identified from the tumor and normal samples. For example, the sequencing data can be analyzed to identify allele-specific copy number alterations from the particular HLA allele-type. Additionally or alternatively, the sequencing data can be analyzed to estimate tumor purity (alternatively referred to as tumor cellularity) and tumor ploidy.
At step 120, the sequence reads can be aligned to one or more reference sequences (e.g., a hs37d5 reference genome build). The subject-specific homologous alleles can be aligned to determine positions of difference between the alleles. Both single nucleotide variants (SNVs) and indels can be detected in the alignment. In some instances, only the first position of each indel can be considered to ensure SNVs can be appropriately weighted. HLA alleles with fewer than 5 positions of difference between them can be considered to be homozygous.
In some instances, the reference sequences corresponding to the identified HLA alleles are retrieved by querying an HLA-allele database. The HLA-allele database can retrieve the reference sequences corresponding to the identified HLA alleles using an imputation approach. The HLA-allele database can be initialized with a particular format, such as multiple sequence alignment (MSA) format from IMGTv312. To implement imputation, cDNA data can be used to impute exons in HLA alleles and incompletely sequenced HLA alleles with a reference allele that had protein-level identity as defined by an identical 4 digit nomenclature. In the event that no such allele exists, a reference sequence from the same HLA subtype can be retrieved from the HLA-allele database based on identical 2 digit nomenclature. If there are multiple options with identical 2 digit nomenclature, the first allele listed in the MSA can be used. To impute the intronic regions of each allele, the above approach can be taken using a gDNA file. The full length genomic sequences of each allele can thus be imputed by assembling exons from the cDNA imputation set and the introns from the gDNA imputation step. In some instances, duplicate reads are removed. Additionally or alternatively, the Genome Analysis toolkit (GATK) can be used to correct base quality scores and improve sequence alignment of the sequence reads.
In some instances, any sequence reads that had soft clipping for more than 20% of their total length are excluded. Any reads that contained mismatches can be discarded in order to improve quality of coverage information. However, if a somatic mutation within the HLA alleles is identified, the stringency can be lifted to allow sequence reads with a single mismatch.
At step 125, allele-specific data of identified HLA alleles corresponding to each of the normal and tumor samples can be determined. The allele-specific data can identify a number of sequence reads that align to each genomic region corresponding to the identified HLA alleles. A copy number alteration of sequence reads for a particular genomic position may indicate a loss of heterozygosity of the corresponding HLA allele. For example, with respect to B*13:01:01 HLA allele, a decreased number of sequence reads can be identified in genomic positions ranging between 1800-2000 in a tumor sample relative to the normal sample. Such copy number alteration can indicate to a loss of heterozygosity of the B*13:01:01 HLA allele.
As described above, the DASH model can be used to process sequence data and generate a result corresponding to a probability of whether a loss of heterozygosity exists in an HLA allele identified in the biological sample of the particular subject. To initiate the process, the sequence data corresponding to a biological sample of a particular subject can be generated by using whole genome sequencing or whole exome sequencing on the biological sample. The biological sample can be a tissue sample that may include DNA derived from tumor or healthy cells. In some instances, the biological sample includes cell-free DNA, some of which can have originated from healthy cells and some from tumor cells.
In some instances, HLA genotyping is performed on the plurality of sequence reads to identify one or more HLA alleles that correspond to the sequence data. Reference sequences corresponding to the identified HLA alleles can be retrieved, and the sequence reads can be aligned to the retrieved reference sequences. After alignment, allele-specific coverage for each genomic region can be determined for the identified HLA alleles corresponding to the sequence data. In some instances, allele-specific coverage identifies a number of sequence reads that align to each genomic position of the identified HLA allele.
The trained DASH model can be used to process the allele-specific data for each of the identified HLA alleles to generate the result corresponding to a probability of whether a loss of heterozygosity exists in each of the identified HLA alleles. Other types of information corresponding to the identified HLA alleles (e.g., an indication of whether at least part of a flanking genomic region surrounding the HLA allele has been deleted) can be used as input to the trained DASH model. The trained DASH model that includes one or more gradient boosting algorithms can process the above features of the sequence data to generate the result.
In some instances, the allele-specific data for each of the identified HLA alleles is as an input to the trained DASH model. Other types of information corresponding to the identified HLA alleles (e.g., an indication of whether at least part of a flanking genomic region surrounding the HLA allele has been deleted) can be used as input to the trained DASH model. The trained DASH model that includes one or more gradient boosting algorithms can process the above features of the sequence data to generate the result. In some instances, the result is used to predict a decrease in efficacy of an immune checkpoint blockade therapy being administered to the particular subject.
The DASH model can be trained using the aligned sequence data and allele-specific coverages to accurately detect allele-specific loss of heterozygosity in HLA genes. Specifically, a training dataset for the DASH model can include features derived from paired tumor and normal samples (either adjacent tissue or peripheral blood mononuclear cells) from subjects. As described above, to identify the training features for the training data set, aligned sequence data and allele-specific coverages corresponding to HLA alleles can be identified by applying whole exome sequencing to paired tumor and normal samples corresponding to the subject.
In some instances, capture probes covering specific HLA alleles are applied in addition to the whole exome sequencing. Sequence reads corresponding the HLA alleles for each subject can be mapped to each subject-specific HLA reference. From the alignment data, allele-specific coverages can be determined for each genomic region and training features corresponding to somatic variants of the HLA alleles can be identified. The training features can include a modified B allele frequency that accounts for differences in probe capture and consistency of allele-specific coverages across various HLA alleles. In some instances, the training features further include information corresponding to genomic regions surrounding the HLA alleles as a vast majority of HLA loss of heterozygosity events are large deletions.
The DASH model can include one or more gradient-boosting algorithms, which can be trained to detect loss of heterozygosity in HLA alleles. Gradient boosting refers to a machine-learning technique for regression and classification problems that produce a prediction model in the form of an ensemble of weak prediction models. The technique may build a model in a stage-wise fashion and generalizes the model by allowing optimization of an arbitrary differentiable loss function. Gradient boosting combines weak learners into a single strong learner in an iterative fashion. As each weak learner is added, a new model is fitted to provide a more accurate estimate of the response variable. The new weak learners can be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble. Examples of the gradient boosting machines can include XGBoost and LightGBM. Additionally or alternatively, other types of machine-learning techniques can be used to build the binding model, including bagging procedures, boosting procedures, and/or random forest algorithms.
An example training data included a set of 720 heterozygous HLA genes which collected from 279 subjects across multiple tumor types. All features described above for each heterozygous gene were generated and each case of HLA loss of heterozygosity were manually labeled. To train the DASH model, 500 heterozygous genes were separated for training and 220 heterozygous genes were kept separately for testing. With respect to model selection, the DASH model can include a gradient boosting algorithm (e.g., XGBoost) to learn how to detect HLA loss of heterozygosity in each pair of HLA alleles from the features described above. If HLA loss of heterozygosity was detected by the DASH model, the allele with the lower coverage was labeled as deleted. Though rare, few cases with a bi-allelic deletion were detected. If the DASH model detects HLA loss of heterozygosity and the allele with higher coverage has an allele-specific coverage ratio below 0.5 for at least 25% of the bins, both HLA alleles are labeled as deleted.
From the allele-specific coverage data, the allele-specific features 205 can be determined. The allele-specific features 205 include the following:
In addition, the subject-specific features 210 can include the following:
Finally, the whole-exome features 215 can include the following:
The diagram 200 additionally shows a bar graph 220 demonstrating that the trained DASH model performs better than other conventional techniques, even when the same sequencing data is used. In this example, biological samples with tumor purity below 20% are removed from the analysis. As shown in the bar graph 220, 100% sensitivity and 99.7% specificity levels of the trained DASH model (shown in green) are respectively greater than 91.8% sensitivity and 94.3% specificity levels of an LOHHLA algorithm (shown in blue), which is an existing conventional technique for detecting loss of heterozygosity as published in McGranahan, Nicholas et al. “Allele-Specific HLA Loss and Immune Escape in Lung Cancer Evolution.” Cell vol. 171, 6 (2017): 1259-1271.e11. doi: 10.1016/j.cell.2017.10.001. When all biological samples are considered (including those with tumor purity that is below 20%), the DASH model reaches 98.7% specificity and 92.9% sensitivity (F-1 Score=0.939), while the LOHHLA algorithm only achieves 94.3% specificity and 78.8% sensitivity (F-1 Score=0.777). The DASH model also outperforms other existing conventional techniques. For example, Sequenza detects HLA loss of heterozygosity at 92.9% specificity and 95.0% sensitivity (F1-Score=0.848). Moreover, none of the above conventional techniques for detecting loss of heterozygosity was able to identify the specific allele that has been lost.
The training data set 300 may also include a scatter plot 310 showing a relationship between allele-specific coverage ratio and tumor purity. HLA genes with loss of heterozygosity are shown in filled color and HLA genes without loss of heterozygosity are shown in partially-shaded color. In some instances, allele-specific coverage ratio is determined dividing the tumor coverage of an allele by the adjacent normal coverage of the same allele and normalizing by the coverage across the rest of the exome.
The training data set 300 may also include boxplots 315 showing the difference in distribution between consistency in coverage for subjects with and without HLA loss of heterozygosity. The boxplots 315 can be used to capture an observation that alleles with consistently lower coverage than the alternate allele across the entire gene is likely to be deleted, whereas sporadically lower coverage may be due to stochastic variation (p=2.2e-14, paired T test). In addition, a difference of distribution can be shown between total sequencing depth for subjects with and without loss of heterozygosity. As noted above, the total coverage for subjects with HLA loss of heterozygosity is relatively lower than the total coverage for subjects without HLA loss of heterozygosity. Thus, in order to distinguish allelic imbalance driven by loss of heterozygosity from allelic imbalance driven by a large amplification of the alternate allele, total coverage ratio, that captures the combined coverage of the two alleles. Subjects with HLA loss of heterozygosity have significantly lower total coverage ratios (p=0.0004, paired T test) with the null hypothesis that subjects with and without HLA loss of heterozygosity will have the same distribution of total coverage ratios.
The training data set 300 may also include histograms 320 and 325, in which the histogram 320 shows the distributions of tumor purity for subjects with and without loss of heterozygosity and the histogram 325 shows distribution of tumor ploidy for subjects with and without loss of heterozygosity. HLA genes with loss of heterozygosity are shown in transparent color, and HLA genes without loss of heterozygosity are shown in shaded color. The histograms 320 and 325 showed that tumor purity and tumor ploidy are identical across HLA-A, HLA-B and HLA-C of a particular subject.
The training data set 300 can also include a histogram 330 that shows a distribution of loss of heterozygosity sizes corresponding to HLA genes across all subjects. Since 73% of copy number alterations causing HLA loss of heterozygosity are deletions of greater than one megabase, genomic regions flanking the genes of interest can provide useful information to supplement the within-gene data. Thus, the whole exome nature of the training data set 300 was used to generate a feature corresponding to deletion of flanking regions, which can measure deletions in the 10 kb region surrounding each HLA gene.
A bar plot 410 shows an impact of each feature in the trained DASH model for detecting loss of heterozygosity in HLA alleles. In some instances, the impact of each feature is measured based on a game theory model. The bar plot 410 revealed that all six of our features were independently contributing to the DASH model, with deletion of flanking regions and adjusted B-allele frequency impacting the outcome most significantly relative to other features (e.g., tumor ploidy).
A scatter plot 415 shows a distribution of probabilities of HLA loss of heterozygosity returned by the trained DASH model for biological samples corresponding to test dataset and manually annotated as having or not having loss of heterozygosity in HLA alleles. In, the scatter plot 415, a shaded region indicates ambiguous calls by the DASH model. Since the XGBoost algorithm returns a continuous metric, the HLA loss of heterozygosity calls were divided into high and low confidence calls. In the test data set, the trained DASH model reaches 96% specificity and 89% sensitivity when considering high confidence calls (>0.8 loss of heterozygosity prediction cutoff). As noted above, the DASH model can perform at higher specificity and sensitivity levels over other conventional techniques. For comparison, the LOHHLA algorithm reaches 92% specificity and 76% sensitivity using the same test data set.
A histogram 420 shows a distribution of tumor purities of HLA genes in which ambiguous (>0.2 and <0.8) calls were made by the DASH model. As shown in the histogram 420, the majority of the borderline and incorrect calls have low tumor purity, highlighting the difficulty of accurately predicting HLA loss of heterozygosity at low tumor purity levels. When samples with tumor purity below 20% are removed the test data set, the performance level of the DASH model increases, as shown in a precision recall curve 425.
The precision recall curve 425 shows a performance level of the DASH model on a held out dataset (n=220 heterozygous genes). In the curve 425, the dotted line indicates the performance of all samples, and the solid line indicates the performance of samples with at least 20% tumor purity. When samples with tumor purity below 20% are removed from the test data set, the DASH model reached 97% specificity and 95% sensitivity at a >0.8 loss of heterozygosity prediction cutoff. Continuing with the above example, the DASH model performs better than the LOHHLA algorithm, which achieves 94% specificity and 82% sensitivity over the same test data set that excludes the samples having less than 20% tumor purity. In order to achieve an optimal balance between sensitivity and specificity, the 0.2 threshold was applied for the remainder of the analyses. Using this threshold, a very strong performance of the DASH model was observed on high purity samples (F1-Score=0.93) and poorer performance with the inclusion of low purity samples (F1-Score=0.87) on the test data set.
Finally, a bar plot 430 comparing the F1-Scores of various the DASH models trained on individual features to the DASH model trained on all features of the training data set (see
HLA loss of heterozygosity has been observed as occurring late in tumor progression as a resistance mechanism. Furthermore, tumor types that tend to be most responsive to immune checkpoint blockade (lung, skin) also tend to produce lower purity samples. Thus, a limit of detection analysis with a gold standard sample in both low clonality and low purity settings was used to accurately validate performance of the DASH model.
With respect to limit of detection, a tumor-normal paired lymphoblast cell line sample (NCI-H2009) can be used to assess the DASH model across varying tumor purities and clonalities. In the NCI-H2009 sample. HLA-A is homozygous while: (i) both HLA-B*51:01 and HLA-C*15:02 alleles are deleted; (ii) HLA-B*07:02 and HLA-C*07:02 are retained. Deep sequencing can be performed on the tumor and normal cell lines at 50× coverage and 30× coverage, respectively. To stimulate a realistic sequencing depth, the normal data can be downsampled to reflect 25× sequencing coverage. To create tumor data of decreasing purity, increasing proportions of normal reads can be mixed with decreasing proportions of tumor reads. The combined normal and tumor reads can be summed to an average of 35× sequencing coverage to represent the tumor sample. As used herein, sequencing coverage refers to the average number of reads that align to known reference bases. During sequencing, the sequencing coverage level can be used determines whether variant discovery can be made with a certain degree of confidence at particular base positions. For example, a recommended sequencing coverage for whole-genome sequencing may range between 30× to 50×, depending on application and statistical model. In another example, a recommended sequencing coverage for whole-exome sequencing may be 100×.
All combinations of normal and tumor sub samples can be performed in replicates of 10 using the seqkit library. In some instances, to simulate lower sub clonality, the proportion of tumor reads in the mixture was used as the product of desired tumor purity and sub clonality. The tumor purity can be then inflated to reflect the desired tumor purity. Samples without HLA loss of heterozygosity can be simulated by only including normal reads in the tumor sample and increasing the estimated tumor purity to reflect the desired range. These runs can be used to estimate specificity.
To validate allele-specific HLA loss of heterozygosity in samples, subject-specific primers and probes can be designed and tested for depletion of allele-specific DNA with digital PCR. Since each subject has a unique set of up to 6 HLA class I alleles, subject-specific primers and probes can be designed for each subject. These primers and probes can bind with high specificity to each allele of interest and discriminate against all other alleles and the rest of the genome. Due to the similarity of some homologous alleles, good primers and probes may not exist for all subjects. In some instances, primers and probes are designed for eleven homologous allele pairs with HLA loss of heterozygosity predicted by the DASH model from ten different subjects and one cell line to maximize discrimination between alleles. Furthermore, a probe targeting RNase P can be also used to serve as an internal positive control. The HLA allele and RNase P probes can be assigned different fluorescence to allow multiplexing. A negative control sample (e.g., H2O) can be used.
To assess the efficiency of the primers and probes, digital PCR can be performed in triplicate on the DNA from the normal and tumor samples (excluding subject C, which can be performed in duplicate). Three samples can be from the training dataset (B, D, K) and the remaining seven samples can be independent. To analyze the data, both the lost and retained will be normalized by the control gene to account for sample input variation. The primers and probes can be deemed successful if the ratio of the HLA allele copies to the multiplexed RNase P copies can be 0.5 in the normal sample because the HLA allele can be expected to be haploid and RNase P is expected to be diploid. Then, for the primer designs that fit this requirement, the allele to RNase P ratio in the tumor DNA is compared to the allele to RNase P ratio in the normal DNA with a one-sided T-test to determine if there has been a significant drop in the tumor. This test is performed for both the predicted retained allele and the predicted lost allele. Allelic imbalance is determined by measuring a significant difference between the predicted lost and predicted retained alleles in the normal DNA and the tumor DNA. Of note, this validation focuses on specific sections of each gene. Thus, it is not formulated to catch small focal deletions in a small portion of the gene.
To assess the functional impact of HLA loss of heterozygosity on peptide presentation by MHC molecules, quantitative immunopeptidomics can be performed on two colorectal and four lung tumor-normal paired fresh frozen samples. The samples can be homogenized, normalized for protein content between the tumor and normal and the clarified homogenates can be applied to a pan-MHC-I antibody (W6/32)-linked immunoaffinity resin. In some instances, the success of immunoprecipitation from the lysates is assessed using ELISA, by comparing the MHC concentration pre- and post-IP. MHC-associated peptides can be eluted and collected. Eluted peptides from tumor and normal samples can be labeled and analyzed in a single run for each pair, in high resolution HCD mode.
The resulting raw files of all six samples can be processed together. Peptide identification can be performed using a de novo identification followed by a database search. For example, parameters for database search can be as follows—precursor mass tolerance: 10 ppm, fragment mass tolerance: 0.03 Da, protein database: uniprot sequences downloaded in April 2019, enzyme digestion: none, fixed modifications: carbamidomethylation of cysteine (+57.02 Da) and TMT10plex at all N-terminal amino acids and lysines (+2291.6), variable modifications: protein N-terminal acetylation (+42.0106) and oxidation of Methionine (+15.9949). Peptides can be filtered at 1% FDR and reporter ions can be quantified. The list of quantified peptides can be further filtered to increase the quality of calls by removing peptides that do not have expected TMT n-terminal or lysine modifications, peptides with low intensity (less 10E4 precursor ion intensity) and suspicious peptides with poly amino acids. Then, the intensities can be log 2 transformed and the data can be median normalized. Finally, a fold change can be calculated from the log 2 transformation, with values less than 0 representing a depletion of peptide in the tumor sample and values greater than 0 representing enrichment of peptide in the tumor sample.
To assess overall changes in presentation between the normal and tumor samples, the absolute values corresponding to the logarithm of the fold changes were compared amongst the samples. Subsequently, the peptide change for specific alleles was estimated. For each subject, each peptide of a peptide set was assigned to an MHC allele. If a peptide can be predicted to bind to multiple alleles, it was considered ambiguous and excluded from the analysis. If the only peptides predicted to bind to an allele can be predicted to bind to more than one allele, they were included but was marked (e.g., an asterisk). The logarithm of the fold change can be visualized to assess the enrichment or depletion of peptides from particular alleles in the tumor sample. The log 2 transformed intensity values can be compared with a Wilcoxon Rank Sum Test to assess the statistical significance of any enrichment or depletion. All comparisons with tumor purity use the tumor purity as estimated.
The MHC class I presentation prediction model can be performed using a machine-learning model trained using large scale immunopeptidome datasets and benchmarked against a existing binding-prediction model (e.g., NetMHCpan 4.0) with superior performance across several metrics. The output of the trained model can be normalized by allele in a similar manner to NetMHCpan 4.0, creating a rank metric. In some instances, the percentile rank threshold used is 0.1%. All peptide-allele combinations with ranks below this threshold can be considered as peptides that are bound and presented on a cell surface.
(2) Validation Results for Allele-Specific Genomic Validation with Digital PCR
A bar plot 710 shows an allele-specific copy number of the predicted lost allele relative to RNase P, based on the allele-specific genomic validation using digital PCR. The allele-specific number was measured by digital PCR for cell line mixtures of varying tumor purities. To ensure the specificity of the primers, both the predicted lost and predicted retained allele copies normalized by half of the diploid RNase P copies resulted in a one copy in the normal samples.
The bar plot 710 includes a dashed line, which denotes the expected value for no change in copy number. Then, the copy number of each allele in the tumor sample was compared to the normal sample (e.g., using the one-sided Student T test). Asterisks show results indicating whether a statistically significant difference was found based on the comparison with the copy number in the normal sample. One or more copies of the lost allele in the tumor sample were found as tumor purity increased above zero, confirming the allele-specific LOH event. Digital PCR sensitivity 100% for 10% tumor purity and above, confirming the sensitivity and replicability of allele-specific digital PCR as an orthogonal method.
Bar plots 715 and 720 respectively show a ratio between HLA allele digital PCR copy to the multiplexed RNase P digital PCR copy. In the bar plots 715 and 720, cell line data is shown on the left and subject data is shown on the right. In addition, grey bars indicate the ratio in the normal DNA and green bars indicate the ratios in the tumor DNA. The alleles predicted by the DASH model to be retained are shown on the bar plot 715 while the alleles predicted to be deleted are shown on the bar plot 720. The dashed grey lines indicate the expected ratio of 0.5 if there are no copy number alterations. Asterisks indicate samples with p-values less than 0.05 as determined by a one-sided student T test. To ensure the specificity of the primers, it was confirmed that both the predicted lost and predicted retained allele copies normalized by RNase P copies resulted in a 0.5 ratio in the normal sample.
The ratios for 20 of the 22 primers were found to be highly specific and in close proximity to 0.5. However, the predicted retained allele from subject C and the predicted lost allele from subject K were excluded due to low specificity. Then, the ratios of each allele in the tumor sample to the normal sample were compared.
Further, as shown in the asterisks above each bar of the respective bar plots 715 and 720, a significant depletion was found in only one of the nine predicted retained alleles and a significant depletion in eight of the nine predicted lost alleles in the tumor samples. The subject without significant digital PCR depletion in the predicted lost allele (subject J) appears to have a large amplification in the digital PCR of the retained allele. This amplification can be confirmed with a standard copy number variation call in the region surrounding the HLA gene.
A scatter plot 725 shows a distribution of probabilities of HLA loss of heterozygosity returned by the DASH model with their tumor purities. The red region indicates ambiguous calls by the DASH model. The grey vertical line indicates 20% purity. In the scatter plot 725, subject D's predicted retained allele appears to have a slight reduction in the tumor. The significant allelic imbalance in the tumor suggests that this could be due to a subclonal bi-allelic deletion or simply an amplification of the RNase P control. Excluding the call from subject D, 95% of the retained and lost alleles predicted above the 0.8 threshold were identified correctly. Furthermore, several of the samples had low tumor content, confirming the accuracy of the DASH model across variable tumor purities. The subject-specific digital PCR presented here thus represents the first allele-specific genomic HLA loss of heterozygosity validation assay.
Bar plots 815 show specificity of each primer design as measured by a ratio between the allele digital PCR copies and the multiplexed RNase P digital PCR copies in the normal sample. The grey dashed line in each of the bar plots 815 indicates an expected copy number of 1. A bar plot 820 shows a ratio between a predicted lost allele (normalized by RNase P copies) and a predicted retained allele (normalized by RNase P copies) to show allelic imbalance predicted by the DASH model. A dashed grey line of the bar plot 630 shows a ratio of one, which is expected in the normal samples. Deviations below the dash grey line suggest allelic imbalance. Referring to the bar plot 820, the resulting significant allelic imbalance likely caused a lower confidence deletion prediction.
HLA loss of heterozygosity is hypothesized to reduce the neoantigen load by eliminating surface presentation of neoantigens that would bind to specific HLA alleles. Such hypothesis has been demonstrated with organoids, but it has not been shown in complex subject tumor samples. Thus, in order to provide functional evidence of reduced peptide presentation for alleles that the DASH model predicts are lost, quantitative changes in peptide presentation were measured between adjacent-normal samples without HLA loss of heterozygosity and tumor samples with HLA loss of heterozygosity.
Waterfall plots 910 show log 2 fold change from a normal sample to a tumor sample for peptides binding to each of the alleles in a subject. In the waterfall plots 910, dark color indicates peptides that are less frequent in the tumor while shaded color indicates peptides that are more frequent in the tumor. The dashed grey line represents the mid point of the plot and the triangles indicate the crossover point for each allele. Each waterfall plot of the waterfall plots 910 indicates whether a subject HLA allele has been deleted or retained. The peptides for each allele are visualized as a motif. Statistical significance assessed using a Wilcoxon paired rank sum test. In the waterfall plots 910, three of four deleted alleles (predicted bi-allelic deletions) had significantly fewer predicted binding peptides in the tumor sample than in the adjacent normal sample.
Box plots 915 show log 2 fold changes of peptide intensity between lost, kept, and homozygous alleles across HLA-A. HLA-B. and HLA-C alleles. Statistical significance was assessed using a two-sided student T-test. The box plots 915 show that peptides predicted to bind to lost alleles had reduced peptide intensity in tumor samples compared to normal samples for HLA-A and -B alleles.
(b) Relationship Between Samples without HLA Loss of Heterozygosity and Samples with Predicted HLA Loss of Heterozygosity
A box plot 1015 shows distributions of log 2 fold change intensities in samples without HLA loss of heterozygosity (control) and samples with HLA loss of heterozygosity. In the samples without any predicted HLA loss of heterozygosity (M and P samples), minimal differences were found in surface peptide between the tumor and normal samples. For example, the box plot 1015 shows, for the samples without any predicted HLA loss of heterozygosity, an interquartile of peptide log fold changes ranging from −0.010 to 0.013, and the median peptide fold change close to zero for all alleles. In contrast, the samples with predicted HLA loss of heterozygosity (L. C. O and N samples) showed twice as much variability in peptide presentation between the tumor and normal samples, with the interquartile of peptide log fold changes ranging from −0.026 to 0.023.
A scatter plot 1020 shows a relationship between estimated tumor purity and a standard deviation of the log 2 fold change of peptide intensities. With respect to the scatter plot 1020, green dots depict samples without HLA loss of heterozygosity and blue dots depict samples with HLA loss of heterozygosity. The deviation of intensities between tumor and normal samples increased as the samples gained in tumor purity, with the highest purity sample showing an average deviation of 0.062 (L. 58% tumor purity).
(c) Immunopeptidomics Data Corresponding to Samples without Predicted HLA Loss of Heterozygosity
(d) Immunopeptidomics Data Corresponding to Samples with Predicted HLA Loss of Heterozygosity
Though the three waterfall plots 1205-1215 representing additional low tumor purity samples with predicted HLA loss of heterozygosity had a larger peptide log fold change than the control samples (e.g., the waterfall plots 1105-1110 of
HLA loss of heterozygosity prevalence data can demonstrate that a large percentage of subjects are impacted by HLA loss of heterozygosity in several tumor types. Though non-small-cell lung carcinoma is known to have a high incidence of HLA loss of heterozygosity, a large fraction of HLA loss of heterozygosity was identified in other types of cancers, including cervical cancer (44%) and head and neck squamous cell carcinoma (40%). In contrast, only 14% of subjects with HLA loss of heterozygosity were observed in melanoma, which also has a high mutational burden. Further, cervical cancer is strongly associated with human papillomavirus (HPV), which may play a role in the high frequency of HLA loss of heterozygosity. In some instances, subjects lost more than one HLA-allele at a time, potentially having stronger implications on tumor evolution.
To assess the pervasiveness of HLA loss of heterozygosity as a potential immune escape mechanism, the DASH model as applied to 611 tumors across 15 tumor types. A total of 593 subjects from across 14 tumor types were considered for analysis. Each subject had a tumor sample and a normal sample that was sequenced and analyzed. A subset of these samples was used for training the DASH model. The DASH model was applied on each sample to predict the genes (HLA-A, -B and -C) that impacted by HLA loss of heterozygosity. The frequencies of HLA loss of heterozygosity co-occurrence between multiple genes within a single subject were calculated based on a reduced cohort that only contained fully heterozygous subjects.
A bar plot 1310 shows the number of subjects with 1, 2 or 3 genes impacted by HLA loss of heterozygosity. In this example, only subjects that are fully heterozygous across HLA-A, -B and -C are shown. In the bar plot 1310, subjects with HLA loss of heterozygosity more frequently lost all three genes (70% of the subjects), compared to losing only one gene or two genes (20% and 10% of subjects, respectively).
A box plot 1315 shows a distribution of the fraction of each genome impacted by HLA loss of heterozygosity. Each tumor type is divided into subjects with HLA loss of heterozygosity and without HLA loss of heterozygosity. Only tumor types with at least 10 subjects impacted by HLA loss of heterozygosity are shown. Statistical analyses are performed with mann whitney U tests and are Bonferroni corrected. Though high frequencies of HLA loss of heterozygosity in specific tumor types are of interest due to impairment of the antigen presentation pathway, the high frequencies alone do not necessitate an evolutionary advantage of the loss of heterozygosity event. As shown in the box plot 1315, it was found that subjects with HLA loss of heterozygosity have significantly higher estimated rates of loss of heterozygosity across their genome, suggesting that some loss of heterozygosity in the HLA region may happen by chance (pan-cancer p<2.2e-14).
To investigate if HLA loss of heterozygosity frequencies across cancer types would occur by chance, the average estimated rate of loss of heterozygosity across the genome was compared with the frequency of HLA loss of heterozygosity in a given tumor type cohort. If loss of heterozygosity was randomly occurring in the HLA region, it would be expected that rate and frequency to be similar. In particular, if a region with an estimated copy number of the B allele is zero, the region can be considered as having HLA loss of heterozygosity. The total number of base pairs impacted by loss of heterozygosity can be totaled for each subject and divided by the total number of base pairs across the exome (3.2 billion) to obtain the fraction of the genome with loss of heterozygosity. Though the fraction may be an underestimate due to limited coverage of genomic regions without genes, the underestimation can be expected to be consistent across subjects.
A scatter plot 1320 shows a relationship between the average fraction of the genome impacted by loss of heterozygosity and the frequency of HLA loss of heterozygosity in each tumor type. The grey dashed line indicates x=y. As shown in the scatter plot 1320, almost all tumor types have a higher frequency of HLA loss of heterozygosity than genome-wide loss of heterozygosity. While this difference is small for some tumor types, it was observed that colorectal cancer, kidney renal clear cell carcinoma, non-small-cell lung carcinoma-A, pancreatic cancer and head and neck squamous cell carcinoma had substantial enrichment of HLA loss of heterozygosity. The data shown in the scatter plot 1320 suggests that HLA loss of heterozygosity may provide a greater evolutionary advantage in these tumor types than others. Alternatively, HLA may be more prone to deletion than the rest of the genome.
Bar plots 1325-1340 show differences of neoantigen expression between subjects without HLA loss of heterozygosity (green) and subjects with HLA loss of heterozygosity (blue), with the assumption that tumors with a greater ability to display neoantigens would be under higher selective pressure to incur HLA loss. The bar plot 1325 shows an average difference of neoantigen burden between two subject categories across various types of cancers. The bar plot 1325 shows a difference between two subject categories. In addition, the remaining bar plots 1330-1340 show difference between the two subject categories. For example, the bar plot 1330 shows a statistically significant difference between two subject categories for CD274 (PD-L1) expression (p=0.02), the bar plot 1335 shows a statistically significant difference between two subject categories for percentage of microsatellite sites with instability (p=0.01), and the bar plot 1340 shows a statistically significant difference between two subject categories for percentage of patients with Fusobacterium nucleatum, which is an oral bacteria with known colon cancer associations (p=0.005). Accordingly, the box plots 1325-1340 show that there could be a possibility that tumors with a greater ability to display neoantigens can cause HLA loss of heterozygosity.
Since HLA loss of heterozygosity impacts the ability of a tumor cell to present antigen on the cell surface for recognition by the immune system, it was hypothesized that tumors with a greater ability to display neoantigens would be under higher selective pressure to incur HLA loss. Considering subjects with high HLA evolutionary diversity are able to present a larger immunopeptidome and respond better to checkpoint inhibitors, experiments were conducted to determine if such subjects are more susceptible to HLA loss.
Further, a boxplot 1410 shows a distribution corresponding to a number of mutations (e.g., single-nucleotide variant, indel and fusion) across subjects with and without HLA loss of heterozygosity. Only tumor types with at least 8 subjects impacted by HLA loss of heterozygosity are shown. Statistical analyses are performed with mann whitney U tests and are Bonferroni corrected. Mutational burdens—a number of mutations-were identified using tumor-specific genomic events of at least 5% allelic fraction that were verified using transcriptomic data. All potential neoepitopes (8-, 9-, 10- and 11-mers) were created for each mutation and tested for presentation. If any 8-, 9-10- or 11-mers containing the mutation were predicted to bind to any of the subject-specific alleles, they are considered putative neoepitopes. The boxplot 1410 shows high mutation rates and neoantigen burdens can present pressure for cells to lose HLA. A boxplot 1415 shows percentage of patients with HLA LOH in each ventile of mutation burden pan-cancer. Both of the boxplots 1410 and 1415 show “goldilocks effect” for mutation burdens, in which diseases with the lowest tumor mutational burden and highest tumor mutational burden exhibited the lowest prevalence of HLA loss of heterozygosity, whereas tumors in between exhibited the highest prevalence of HLA loss of heterozygosity.
A boxplot 1420 shows a distribution of predicted neoepitopes across subjects with and without HLA loss of heterozygosity across each of various types of cancers. Statistical analyses are performed with mann whitney U tests and are Bonferroni corrected. The boxplot 1420 shows pan-cancer evidence for a correlation with neoantigen burden (p=0.03). Further, a boxplot 1425 shows correlations between HLA loss of heterozygosity and CD274 expression (PD-L1) across each of various types of cancers (p=0.02), and a boxplot 1430 shows correlations between HLA loss of heterozygosity and microsatellite instability (MSI) status across each of various types of cancers (p=0.01). It was found that more neoantigens are predicted to bind to lost HLA alleles than their homologous counterparts (Wilcoxon rank sum test, p=0.01), thereby suggesting that HLA loss of heterozygosity contributes to the selective exposure of antigen to the immune system.
The allele-specific neoantigen composition changes detected in the head and neck squamous cell carcinoma cohort suggest that HLA loss of heterozygosity is altering tumor evolution in response to immune checkpoint blockade therapies. This observation corroborates that HLA sequence variability is a component of effective immune checkpoint blockade response by tumor cells. Though larger cohorts with detailed response data are needed to confirm the impact of HLA loss of heterozygosity on subject response and survival, the examples shown below suggest that accurate detection of HLA loss of heterozygosity will be a factor for checkpoint immunotherapy and cancer vaccine target selection.
Although HLA loss of heterozygosity appears to apply limited evolutionary pressure during tumor growth, immune checkpoint inhibitors serve to increase immune pressure. Thus, the impact of HLA loss of heterozygosity in response to immunotherapy was investigated. Since HLA loss of heterozygosity severely reduces the immunopeptidome by eliminating several HLA alleles, it was reasoned that HLA loss of heterozygosity should impair response to immunotherapy through reduced MHC presentation.
To identify HLA loss of heterozygosity in response to immunotherapies, an experiment was conducted to identify HLA loss of heterozygosity on a cohort of seven head and neck squamous cell carcinoma subjects who received a single dose of PD-1 inhibitor (nivolumab). The pre- and post-treatment tumor biopsies corresponding to the subjects were sequenced. The interaction between germline variability in HLA sequence and pretreatment somatic alterations to antigen presentation machinery was identified. Using the DASH model, four subjects with HLA loss of heterozygosity were found. Moreover, Beta-2 microglobulin loss of heterozygosity in one subject and somatic mutations in HLA alleles of two other subjects were discovered. The three subjects with the most germline HLA sequence diversity all suffered from HLA loss of heterozygosity, with the subject with the highest diversity also having Beta-2 microglobulin loss of heterozygosity.
Pre- and post-intervention matched normal, tumor and plasma samples were collected from a cohort of 7 subjects with head and neck squamous cell carcinoma. Following baseline sample collection all subjects received a single dose of nivolumab, followed by definitive resection of the primary tumor mass approximately one month later when feasible, or a second biopsy where resection was impractical. Due to the resection protocol, RECIST criteria was not used to evaluate response in resected subjects. Solid tumor and matched normal samples were profiled.
For each subject with pre- and post-treatment samples, the DASH model can be used to predict occurrence of HLA loss of heterozygosity. In addition, an HLA evolutionary diversity score can be determined for detecting HLA somatic mutation and Beta-2 microglobulin loss of heterozygosity. Potential epitopes for each mutation detected pre- or post-treatment can be predicted to detect binding with all subject-specific alleles, as described above. Neoepitopes can be identified from mutations that are observed post-treatment but are not observed pre-treatment. In some instances, a neoepitope is predicted to bind to multiple HLA alleles. Some neoepitopes bound to homozygous alleles can be excluded. A paired Wilcoxon Rank Sum Test can be performed to assess the statistical significance of the number of novel neoepitopes predicted to bind to the lost HLA-A/-B alleles and their retained homologous allele.
In
Bars in a bar graph 1510 indicate an HLA-I Evolutionary Divergence score for each subject in the cohort population identified in Section VII(a) herein. For HLA genes, shaded boxes indicate somatic loss of heterozygosity and shaded boxes for rows “HLA mutation” and “B2M LOH” indicate a mutation in an HLA gene or loss of heterozygosity in the Beta-2 microglobulin gene, respectively. Homozygous alleles and alleles with very few differences are noted with grey squares.
Circle plots 1515 indicate a ratio of novel post-treatment neoantigens predicted to bind to each of a subject's HLA alleles. Portions of each circle plot is shaded differently to identify whether HLA alleles are deleted or retained. The outer circle shows the ratio of neoepitopes predicted to bind to all lost and retained alleles, and the inner circle shows the breakdown by specific allele. The value inside the circle represents the number of novel neoepitopes predicted post treatment (multi-counted if predicted to be presented by multiple alleles). For the circle plots 1515, neoepitopes presented by homozygous alleles were excluded.
As shown in the circle plots 1515, in each subject with HLA loss of heterozygosity, it was found that more new post-treatment neoantigens were predicted to bind to deleted HLA alleles than to the retained HLA alleles for the corresponding subject.
A scatter-line plot 1520 shows a paired relationship between the count of novel, post-treatment allele-specific predicted neoepitopes for retained and deleted HLA alleles. Only HLA-A and -B alleles are shown in the scatter-line plot 1520. Statistical significance is assessed using a Wilcoxon paired rank test. Since sequence diversity in HLA-A and -B alleles alone may have an impact on response to immunotherapy, the number of novel post-treatment neoantigens predicted to bind to HLA-A and-B alleles were compared to their homologous counterparts and found a statistically significant difference across the cohort (p=0.027, Wilcoxon signed-rank). This consistent shift in neoantigen composition suggests that HLA loss of heterozygosity acts as an evolutionary force in resistance to response during immunotherapy.
Box plots 1610 shows a difference in estimated tumor infiltrating CD8+ T cell quantification pre- and post-treatment for subjects with and without HLA loss of heterozygosity. Statistical significance is performed using a Mann-Whitney U test. In the box plots 1610, a trend was observed toward increased CD8+ T cells after treatment for subjects without HLA loss of heterozygosity. In contrast, the same trend was not observed in subjects with HLA loss of heterozygosity. The difference between samples shown in the box plots 1610 suggest that a decrease in diversity of neoantigens may reduce immune infiltration.
At operation 1710, a computer system accesses a machine-learning model. The machine-learning model can be trained using a training data set that included, for a subject of a set of subjects: (1) allele-specific features; (2) subject-specific features; and (3) whole-exome features. The allele-specific features can include, for a genomic region of an HLA allele: an adjusted B allele frequency that represents a ratio between a first B allele frequency of heterozygous alleles in the tumor sample that correspond to the genomic region and a second B allele frequency of heterozygous alleles in the genomic region and associated with one or more control samples; and a ratio between a first allele-specific coverage of the tumor sample that corresponds to the genomic region and a second allele-specific coverage of the one or more control samples that corresponds to the genomic region. In some instances, the allele-specific features correspond to a genomic region of an HLA allele that was identified as having a somatic mutation.
The subject-specific features can include an estimated tumor purity value and an estimated tumor ploidy value corresponding to a tumor sample of the subject. Tumor purity, as used herein, refers to as a ratio of tumor cells to total cells in the sample. Tumor ploidy, as used herein, refers to an average copy number of the entire tumor genome. The whole-exome features can include, for the HLA allele, an indication of whether at least part of a flanking genomic region surrounding the HLA allele has been deleted. Example embodiments for training the machine-learning model can be found in Section III of the present disclosure.
Performance of the trained machine-learning model can be evaluated by using one or more validation techniques. For example, the machine-learning model can be validated using in silico cell line mixtures, subject-specific primers and probes generated using digital PCR, and/or immunopeptidomics data corresponding to the training data set. Example embodiments for validating the machine-learning model can be found in Section IV of the present disclosure.
At operation 1720, the computer system accesses sequence data corresponding to a biological sample of a particular subject. The biological sample can be a tissue sample of the particular subject that may include DNA derived from cancer cells. In some instances, the sequence data is derived from the biological sample and a reference sample that does not include cancer cells. The biological sample can include cell-free DNA, some of which can have originated from healthy cells and some from tumor cells. The sequence data can be profiled to identify various characteristics corresponding to the biological sample. For example, the characteristics may include comprehensive tumor mutation information, gene expression quantification, neoantigen characterization, HLA alleles (types and mutations), and tumor microenvironment profiling.
In some instances, the sequence data is generated by using whole genome sequencing or whole exome sequencing on the biological sample to generate a plurality of sequence reads. In some instances, HLA genotyping is performed on the plurality of sequence reads to identify one or more HLA alleles that correspond to the sequence data. Reference sequences corresponding to the identified HLA alleles can be retrieved, and the sequence reads can be aligned to the retrieved reference sequences. After alignment, allele-specific coverage for each genomic region can be determined for the identified HLA alleles corresponding to the sequence data. In some instances, the aligned sequence data can be analyzed to identify allele-specific copy number alterations from the particular HLA allele-type. Additionally or alternatively, the sequencing data can be analyzed to estimate tumor purity (alternatively referred to as tumor cellularity) and tumor ploidy. Example embodiments for generating the sequence data can be found in at least Section I of the present disclosure.
At operation 1730, the computer system generates a result corresponding to a probability of whether a loss of heterozygosity exists in an HLA allele identified in the tissue sample of the particular subject by processing the sequence data using the machine-learning model. The machine-learning model (e.g., the DASH model) uses the allele-specific data for each of the identified HLA alleles as an input to generate the result. Other types of information corresponding to the identified HLA alleles (e.g., an indication of whether at least part of a flanking genomic region surrounding the HLA allele has been deleted) can be used as additional input to the trained machine-learning model. The machine-learning model can include one or more gradient boosting algorithms to process the above features of the sequence data to generate the result.
In some instances, the result is used to predict a decrease in efficacy of an immune checkpoint blockade therapy being administered to the particular subject. The result can be used to predict a particular type of cancer associated with the subject, as the tumor samples with predicted HLA loss of heterozygosity can be identified as having the particular type of cancer based on their corresponding changes in peptide presentation.
At operation 1740, the computer system outputs the result. Process 1700 terminates thereafter.
A computer-readable signal medium includes a propagated data signal with computer-readable program code embodied therein. Such a propagated signal takes any of a variety of forms including, but not limited to, electromagnetic, optical, or any combination thereof. A computer-readable signal medium includes any computer-readable medium that is not a computer-readable storage medium and that can communicate, propagate, or transport a program for use in connection with computer system 1800.
Further, the memory 1804 includes an operating system, programs, and applications. The processor 1802 is configured to execute the stored instructions and includes, for example, a logical processing unit, a microprocessor, a digital signal processor, and other processors. For example, the computing system 1800 can execute instructions (e.g., program code) that configure the processor 1802 to perform one or more of the operations described herein. The program code includes, for example, code implementing the training the DASH model, using the DASH model, accessing the sequence data, and/or any other suitable applications that perform one or more operations described herein. The instructions could include processor-specific instructions generated by a compiler or an interpreter from code written in any suitable computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, and ActionScript.
The program code can be stored in the memory 1804 or any suitable computer-readable medium and can be executed by the processor 1802 or any other suitable processor. In some embodiments, all modules in the computer system for predicting loss of heterozygosity in HLA alleles are stored in the memory 1804. In additional or alternative embodiments, one or more of these modules from the above computer system are stored in different memory devices of different computing systems.
The memory 1804 and/or the processor 1802 can be virtualized and can be hosted within another computing system of, for example, a cloud network or a data center. I/O peripherals 1808 include user interfaces, such as a keyboard, screen (e.g., a touch screen), microphone, speaker, other input/output devices, and computing components, such as graphical processing units, serial ports, parallel ports, universal serial buses, and other input/output peripherals. The I/O peripherals 1808 are connected to the processor 1802 through any of the ports coupled to the interface bus 1812. The communication peripherals 1810 are configured to facilitate communication between the computer system 1800 and other computing devices over a communications network and include, for example, a network interface controller, modem, wireless and wired interface cards, antenna, and other communication peripherals. For example, the computing system 1800 is able to communicate with one or more other computing devices (e.g., a computing device that is used for training and validating the DASH model, a computing device that displays outputs generated by the DASH model) via a data network using the a network interface device of the communication peripherals 1810.
While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. Indeed, the methods and systems described herein may be embodied in a variety of other forms: furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the present disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the present disclosure.
Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing.” “computing.” “calculating.” “determining.” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.
The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multipurpose microprocessor-based computing systems accessing stored software that programs or configures the computing system from a general purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.
Certain embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied—for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.
Conditional language used herein, such as, among others, “can,” “could,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain examples include, while other examples do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without author input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular example.
The terms “comprising.” “including.” “having.” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list. The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Additionally, the use of “based on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Similarly, the use of “based at least in part on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based at least in part on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.
The various features and processes described above may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of the present disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The example blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed examples. Similarly, the example systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed examples.
The present application claims priority to U.S. Provisional Application No. 63/178,151, entitled “Detecting Loss Of Heterozygosity In HLA Alleles Using Machine Learning Models” filed Apr. 22, 2021, the entire contents of which are herein incorporated by reference in their entirety for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/025752 | 4/21/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63178151 | Apr 2021 | US |