This disclosure relates generally to predicting binding affinity between epitopes and human leukocyte antigen (HLA) peptide or protein sequences, and more specifically to computer-based predictions used to determine pan-HLA binding of viral proteins.
This application contains a Sequence Listing in electronic format. The Sequence Listing file, titled 10077-2007700_Sequence_Listing.txt, was created on May 21, 2021, and is 445 bytes in size. The information in electronic format of the Sequence Listing is incorporated herein by reference in its entirety.
Since the emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the novel coronavirus responsible for the coronavirus disease 2019 (COVID-19) global pandemic, medical researchers have focused on the rapid characterization of SARS-CoV-2 to determine possible target proteins or peptides for vaccine and therapeutic treatment development. This research is grounded in an understanding of the human immune system. At a high level, the HLA system or complex is a group of related proteins that are encoded by the major histocompatibility complex (MHC) gene complex in humans. These cell-surface proteins are responsible for the regulation of the immune system. HLAs corresponding to MHC class I (referred to herein as “HLA-I”) present peptides from inside a cell. For example, if the cell is infected by a virus, the HLA system brings fragments of the virus to the surface of the cell so that the cell can be destroyed by the immune system. HLAs corresponding to MHC class II (referred to herein as “HLA-II”) present antigens from outside of the cell to T-lymphocytes. These antigens stimulate the multiplication of T-helper cells (also called CD4+ T cells). CD4+ T cells recognize peptides presented on MHC-II molecules, which are found on antigen presenting cells. They play a major role in instigating and shaping adaptive immune responses, such as by stimulating antibody-producing B-cells to produce antibodies to that specific antigen. An epitope is the part of an antigen such as SARS-CoV-2 that is recognized by the immune system, specifically by antibodies, B cells, or T cells and is the specific piece of the antigen to which an antibody binds.
SARS-CoV-2 has a single-stranded, positive-sense, RNA genome of approximately 30 kilobases (kb), which includes open reading frames encoding nonstructural replicase polyproteins and structural proteins, namely, spike (S), envelope (E), membrane (M), and nucleocapsid (N). The positive-sense genome can act as messenger RNA and can be directly translated into viral proteins by a host cell's ribosomes.
Throughout 2020, early results from research efforts pointed to highest HLA-I/-II binding recognition from SARS-CoV-2 spike (S) and nucleocapsid (N) proteins.
Grifoni, Sidney, Zhang, Scheuermann, Peters, and Sette (bioRxiv, “Candidate targets for immune responses to 2019-Novel Coronavirus (nCoV): sequence homology- and bioinformatic-based predictions”; February 2020) observed that SARS-CoV-2 S and N proteins have the most candidate T & B cell epitopes. This research used reference “Wuhan-Hu-1” viral strain proteins and was based on conserved epitopes from SARS-CoV (the 2003 SARS virus) and SARS-CoV-2 predictions (determined using NetMHC4.0pan) across 12 HLA-I alleles. T-cell epitopes with high sequence identity to SARS-CoV were independently identified by both methods.
Nguyen, David, Maden, Wood, Weeder, Nellore, and Thomson (medRxiv, “Human leukocyte antigen susceptibility map for SARS-CoV-2”; March 2020) observed that genetic variability across the three MHC class I genes (HLA A, B, and C) may affect susceptibility to and severity of SARS-CoV-2. The authors executed an in silico analysis of viral peptide-MHC class I binding affinity across 145 HLA-A, -B, and -C genotypes for all SARS-CoV-2 peptides, and explored the potential for cross-protective immunity conferred by prior exposure to four common human coronaviruses. The analysis showed 48 highly conserved amino acid sequence spans across 34 distinct coronaviruses (ORF1ab, S, E, M, and N proteins), and 56 HLAs that had no affinity for conserved peptides. It also showed that the SARS-CoV-2 proteome is successfully sampled and presented by a diversity of HLA alleles. However, HLA-B*46:01 had the fewest predicted binding peptides for SARS-CoV-2, suggesting individuals with this allele may be particularly vulnerable to COVID-19, as they were previously shown to be for SARS-CoV. Conversely, HLA-A*02:02, HLA-B*15:03, and HLA-C*12:03 showed the greatest capacity to present highly conserved SARS-CoV-2 peptides that are shared among common human coronaviruses, suggesting it could enable cross-protective T-cell based immunity. Global distributions of HLA types were also reported with discussion on potential epidemiological ramifications in the setting of the COVID-19 pandemic.
Grifoni, Weiskopf, Ramirez, Mateus, Dan, Moderbacher, Rawlings, Sutherland, Premkumar, Jadi, Marrama, de Silva, Frazier, Carlin, Greenbaum, Peters, Krammer, Smith, Crotty, and Sette (“Targets of T Cell Responses to SARS-CoV-2 Coronavirus in Humans with COVID-19 Disease and Unexposed Individuals”; May 2020) used HLA-I and II predicted peptide “megapools” to identify circulating SARS-CoV-2-specific CD8+ and CD4+ T cells in ˜70% and 100% of COVID-19 convalescent patients, respectively. CD4+ T cell responses to S proteins, the main target of most vaccine efforts, were robust and correlated with the magnitude of the anti-SARS-CoV-2 IgG and IgA titers. The M, S, and N proteins each accounted for 11%-27% of the total CD4+ response, with additional responses commonly targeting nsp3, nsp4, ORF3a, and ORFS, among others. For CD8+ T cells, S and M proteins were recognized, with at least eight SARS-CoV-2 ORFs targeted. Additionally, SARS-CoV-2-reactive CD4+ T cells were detected in ˜40%-60% of unexposed individuals, suggesting cross-reactive T cell recognition between circulating “common cold” coronaviruses and SARS-CoV-2.
Yarmarkovich, Warrington, Farrel, and Maris (Cell Reports Medicine, “Identification of SARS-CoV-2 Vaccine Epitopes Predicted to Induce Long-Term Population-Scale Immunity”; June 2020) proposed a SARS-CoV-2 vaccine design concept based on identification of highly conserved regions of the viral genome and newly acquired adaptations, both predicted to generate epitopes presented on MHC class I and II across the vast majority of the human population. The study prioritized genomic regions that generate highly dissimilar peptides from the human proteome and are also predicted to produce B cell epitopes. The researchers proposed sixty-five 33-mer peptide sequences predicted to drive long-term immunity for most people, a subset of which could be tested using DNA or mRNA delivery strategies. These included peptides that are contained within evolutionarily divergent regions of the spike (S) protein reported to increase infectivity through increased binding to the ACE2 receptor and within a newly evolved furin cleavage site thought to increase membrane fusion.
As a backdrop to these efforts, Recurrent Neural Networks (RNNs) have been used successfully in recent years for many tasks involving sequential data where the RNN must find connections between long input and output sequences, such as for binding predictions between full peptide and HLA protein sequences. Attention mechanisms that enable improved performance in many tasks are an integral part of modern RNN networks. An attention mechanism can allow the RNN to focus on certain parts of an input sequence when predicting a certain part of an output sequence, enabling easier learning and higher quality predictions.
So far, however, current techniques have yielded limited information in terms of how HLA-I/II binding of SARS-CoV-2 proteins can vary across viral strains and world populations. Particularly, current techniques have not provided sufficient insight into the nexus between HLA-I/II clusters, global frequencies, and binding across SARS-CoV-2 variation. For example, vaccine researchers have yet to find effective techniques that minimize the chances of missing clusters of uniquely functioning HLAs in the quest for SARS-CoV-2 vaccines or therapeutic treatments. Without techniques that yield such information, it has been difficult for medical researchers to achieve the validation and implementation of vaccine or therapeutic treatment concepts that specifically target vulnerabilities of SARS-CoV-2 and engage a robust adaptive immune response in the vast majority of the world population.
Systems, methods, and articles of manufacture for determining pan-HLA binding of viral proteins are described herein. The pan-HLA binding determinations of the various embodiments may enable medical researchers to minimize the chances of missing clusters of uniquely functioning HLAs in the quest for vaccines or therapeutic treatments that are effective across the world population. Such vaccines or therapeutic treatments may be useful in the quest to mitigate the effects of viruses that spread globally, such as SARS-CoV-2.
In one embodiment, a viral protein encoded into variable-length peptides is obtained. Each of the encoded variable-length peptides may be, for example, between 8-15 amino acids in length. A classifier model trained to process encoded variable-length peptides is configured such that, independently per HLA, the classifier model is operable to determine at least one of (a) an average binding prediction of overlapping peptides at each position of the viral protein, (b) a maximum value of a binding prediction of overlapping peptides at each position of the viral protein, (c) a standard deviation of a binding prediction of overlapping peptides at each position of the viral protein, and (d) a combination of one or more of (a)-(c). The classifier model may be trained using encoded variable-length peptides corresponding to training proteins and encoded variable-length proteins corresponding to one or more HLA alleles in the human population. A classification engine is configured to use the classifier model to determine average binding predictions of overlapping peptides at each position of the viral protein independently for each of a plurality of test HLAs comprising HLA-I and HLA-II functional groupings, where determining the average binding predictions includes classifying a peptide as a binder when an average binding prediction corresponding to the peptide satisfies a binding prediction threshold. The binding prediction value threshold may be 0.5. Peptides for which the threshold falls within a multiple of the standard deviation from the average binding prediction value may be classified as ambiguous.
In some embodiments, a plurality of test HLAs encoded into variable-length proteins may be obtained, where the plurality of test HLAs comprises HLA-I and HLA-II functional groupings. The HLA-I functional grouping may comprise HLA-I protein sequences, and the HLA-II functional grouping may comprise HLA-II alpha chain and beta chain sequences. The encoded variable-length peptides corresponding to the viral protein and the variable-length proteins corresponding to the plurality of test HLAs may be processed using the classifier model such that, independently per test HLA, the classifier model is operable to determine an average binding prediction of overlapping peptides at each position of the viral protein. Independently per test HLA, average binding predictions may be mapped in aggregate to locations along the test viral protein such that peptide-HLA interaction is indicated; nearest max locations may be determined for the average binding predictions using a sliding window having a fixed length; top max regions may be determined by selecting the nearest max locations having average binding predictions within a top percentage of values; peptides classified as binders that overlap the top max regions may be selected; and a pan-HLA max region may be determined, where the determining may include setting unselected locations to zero, calculating a mean along an HLA axis of the average binding prediction, and selecting pan-HLA maxima within a top percentage of values based on the mean. The selected peptides classified as binders may be filtered independently for each of the HLA-I and HLA-II functional groupings to identify candidate peptides that overlap the top max regions based on an aggregate of the pan-HLA max regions, and one or more of the candidate peptides may be included in an mRNA-based vaccine or therapeutic treatment for a patient. At least one of the candidate peptides may be selected for inclusion in the mRNA-based vaccine or therapeutic treatment for the patient based on HLA allele frequencies in worldwide populations, and the plurality of test HLAs may correspond to HLA allele frequencies in worldwide populations.
In some embodiments, the viral protein may comprise a SARS-CoV-2 protein variant such as a SARS-CoV-2 nucleocapsid (N) protein variant, spike (S) protein variant, membrane (M) protein variant, or envelope (E) protein variant and the mRNA-based vaccine or therapeutic treatment may be capable of treating a patient having SARS-CoV-2.
In some embodiments, the mapping of peptide-HLA interaction may include indicating locations signifying co-occurrences of peptide attention and HLA attention.
In some embodiments, the fixed length of the sliding window may be based on a dominant peptide length of the variable-length peptides, e.g., a dominant peptide length equal to 9-mers or 15-mers.
In some embodiments, the top max regions may be determined by selecting the nearest max locations having average binding predictions within a top 10% of values or within a top 25% of values, and the pan-HLA max region may be determined by selecting pan-HLA maxima within a top 25% of values.
Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following specification, along with the accompanying drawings in which like numerals represent like components.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
While the invention is described with reference to the above drawings, the drawings are intended to be illustrative, and other embodiments are consistent with the spirit, and within the scope, of the invention.
The various embodiments will now be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific examples of practicing the embodiments. This specification may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein; rather, these embodiments are provided so that this specification will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Among other things, this specification may be embodied as methods or devices. Accordingly, any of the various embodiments herein may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. The following specification is, therefore, not to be taken in a limiting sense.
Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise:
The phrase “in one embodiment” as used herein does not necessarily refer to the same embodiment, though it may. Thus, as described below, various embodiments of the invention may be readily combined, without departing from the scope or spirit of the invention.
As used herein, the term “or” is an inclusive “or” operator and is equivalent to the term “and/or,” unless the context clearly dictates otherwise.
The term “based on” is not exclusive and allows for being based on additional factors not described, unless the context clearly dictates otherwise.
As used herein, and unless the context dictates otherwise, the term “coupled to” is intended to include both direct coupling (in which two elements that are coupled to each other contact each other) and indirect coupling (in which at least one additional element is located between the two elements). Therefore, the terms “coupled to” and “coupled with” are used synonymously. Within the context of a networked environment where two or more components or devices are able to exchange data, the terms “coupled to” and “coupled with” are also used to mean “communicatively coupled with”, possibly via one or more intermediary devices.
In addition, throughout the specification, the meaning of “a”, “an”, and “the” includes plural references, and the meaning of “in” includes “in” and “on”.
Although some of the various embodiments presented herein constitute a single combination of inventive elements, it should be appreciated that the inventive subject matter is considered to include all possible combinations of the disclosed elements. As such, if one embodiment comprises elements A, B, and C, and another embodiment comprises elements B and D, then the inventive subject matter is also considered to include other remaining combinations of A, B, C, or D, even if not explicitly discussed herein. Further, the transitional term “comprising” means to have as parts or members, or to be those parts or members. As used herein, the transitional term “comprising” is inclusive or open-ended and does not exclude additional, unrecited elements or method steps.
Throughout the following discussion, numerous references will be made regarding servers, services, interfaces, engines, modules, clients, peers, portals, platforms, or other systems formed from computing devices. It should be appreciated that the use of such terms is deemed to represent one or more computing devices having at least one processor (e.g., ASIC, FPGA, DSP, x86, ARM, ColdFire, GPU, multi-core processors, etc.) configured to execute software instructions stored on a computer readable tangible, non-transitory medium (e.g., hard drive, solid state drive, RAM, flash, ROM, etc.). For example, a server can include one or more computers operating as a web server, database server, or other type of computer server in a manner to fulfill described roles, responsibilities, or functions. One should further appreciate the disclosed computer-based algorithms, processes, methods, or other types of instruction sets can be embodied as a computer program product comprising a non-transitory, tangible computer readable medium storing the instructions that cause a processor to execute the disclosed steps. The various servers, systems, databases, or interfaces can exchange data using standardized protocols or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods. Data exchanges can be conducted over a packet-switched network, a circuit-switched network, the Internet, LAN, WAN, VPN, or other type of network.
As used in the description herein and throughout the claims that follow, when a system, engine, server, device, module, or other computing element is described as being configured to perform or execute functions on data in a memory, the meaning of “configured to” or “programmed to” is defined as one or more processors or cores of the computing element being programmed by a set of software instructions stored in the memory of the computing element to execute the set of functions on target data or data objects stored in the memory.
It should be noted that any language directed to a computer should be read to include any suitable combination of computing devices, including servers, interfaces, systems, databases, agents, peers, engines, controllers, modules, or other types of computing devices operating individually or collectively. One should appreciate the computing devices comprise a processor configured to execute software instructions stored on a tangible, non-transitory computer readable storage medium (e.g., hard drive, FPGA, PLA, solid state drive, RAM, flash, ROM, etc.). The software instructions configure or program the computing device to provide the roles, responsibilities, or other functionality as discussed below with respect to the disclosed apparatus. Further, the disclosed technologies can be embodied as a computer program product that includes a non-transitory computer readable medium storing the software instructions that causes a processor to execute the disclosed steps associated with implementations of computer-based algorithms, processes, methods, or other instructions. In some embodiments, the various servers, systems, databases, or interfaces exchange data using standardized protocols or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods. Data exchanges among devices can be conducted over a packet-switched network, the Internet, LAN, WAN, VPN, or other type of packet switched network; a circuit switched network; cell switched network; or other type of network.
The focus of the disclosed inventive subject matter is to enable construction or configuration of a computing device to operate on vast quantities of digital data, beyond the capabilities of a human for purposes including determining pan-HLA binding of viral proteins.
One should appreciate that the disclosed techniques provide many advantageous technical effects including improving the scope, accuracy, compactness, efficiency, and speed of determining pan-HLA binding of viral proteins. It should also be appreciated that the following specification is not intended as an extensive overview, and as such, concepts may be simplified in the interests of clarity and brevity.
In addition to the terms above, the following technical terms are used throughout the specification and claims.
The human leukocyte antigen (HLA) system or complex is a group of related proteins that are encoded by the major histocompatibility complex (MHC) gene complex in humans. These cell-surface proteins are responsible for the regulation of the immune system.
An epitope, also known as antigenic determinant, is the part of an antigen that is recognized by the immune system, specifically by antibodies, B cells, or T cells. For example, the epitope is the specific piece of the antigen to which an antibody binds.
An allele, also called allelomorph, is any one of two or more genes that may occur alternatively at a given site (locus) on a chromosome. Alleles may occur in pairs, or there may be multiple alleles affecting the expression (phenotype) of a particular trait.
CD8+ cytotoxic T cells are a subtype of T cells and the main effectors of cell-mediated adaptive immune responses. They kill aberrant cells, such as cancer cells, infected cells (particularly with viruses), or cells that are damaged in another way.
CD4+ T cells recognize peptides presented on MHC class II molecules, which are found on antigen presenting cells. They play a significant role in instigating and shaping adaptive immune responses.
Peptides are short strings of amino acids, typically comprising 2-50 amino acids. Amino acids are also the building blocks of proteins, but proteins contain more. Peptides may be easier for the body to absorb than proteins because they are smaller and more broken down than proteins.
HLAs corresponding to MHC class I (A, B, and C), all of which are the HLA-I group, present peptides from inside the cell. For example, if the cell is infected by a virus, the HLA system brings fragments of the virus to the surface of the cell so that the cell can be destroyed by the immune system.
HLAs corresponding to MHC class II (DP, DM, DO, DQ, and DR) present antigens from outside of the cell to T-lymphocytes. These antigens stimulate the multiplication of T-helper cells (CD4+ T cells), which in turn stimulate antibody-producing B-cells to produce antibodies specific to that antigen. Self-antigens are suppressed by regulatory T cells.
The various embodiments provide for a classifier model to be trained to determine pan-HLA binding of viral proteins based on a limited set of training HLA data, e.g., HLA data pertaining to only one HLA allele or a subset of HLA alleles. Once the classifier model is trained, a classification engine configured to use the classifier model, as described herein, can overcome problems encountered in the development of widely applicable vaccines or therapeutic treatments for viruses when limited or no binding data is available for many HLA alleles present across the worldwide human population. Thus, the limited information currently available on how HLA-I/II binding of SARS-CoV-2 proteins can vary across viral strains and world populations can be addressed by the various techniques described herein to minimize the chances of missing clusters of uniquely functioning HLAs in the quest for SARS-CoV-2 vaccines or treatments.
CD8 T-cell 108 is a cytotoxic T-cell that expresses the CD8 glycoprotein at its surface. Cytotoxic T-cells (also known as TC cells, CTLs, T-killer cells, killer T-cells) destroy virus-infected cells and tumor cells. These cells recognize virus-infected or tumor cell targets by binding to fragments of non-self proteins (peptide antigens) that are between 6-20 amino acids in length (though generally they are 8-15 amino acids in length) and presented by major histocompatibility complex (MHC) class I molecules, such as MHC class I molecule 110. MHC class I molecules are present on the surface of all nucleated cells in humans. Their function is to display intracellular peptide antigens, e.g., peptide 112, to cytotoxic T-cells, thereby triggering an immediate response from the immune system against the peptide antigen displayed. An understanding what kinds of peptides bind well with what kinds of MHC class I molecules (i.e., which peptides are best for activating a cytotoxic T-cell response) is critical for current immunology research, particularly since across the worldwide human population each HLA allele of an MHC compound has different properties. The embodiments herein improve the operation of neural network-based MHC-peptide binding affinity prediction models by allowing for a determination of pan-HLA bindings across viral proteins, such as SARS-CoV-2 variants.
HLA protein sequences are encoded for input into a neural network model following the exact same procedure as peptide encodings.
Using encoded variable-length peptides sequences 1 to N 304, 306, and 308 combined with encoded variable-length protein sequences 1 to N 312, 314, and 316 corresponding to one or more HLA alleles 318 in the human population, neural network-based classifier model 310 is trained to predict pan-allele MHC-peptide binding affinity for HLA I and HLA II alpha and beta chain sequences. Classifier model 310 may comprise one or more recurrent neural networks configured, as described in further detail below, to determine, independently per HLA, an average binding prediction 320 of overlapping peptides at each position of the viral protein. In some embodiments, classifier model 310 may also be configured to determine a maximum value of a binding prediction of overlapping peptides at each position of the viral protein, a standard deviation of a binding prediction of overlapping peptides at each position of the viral protein, and/or a combination of one or more the above. The classifier is trained to predict peptide binding to HLA molecules based on an extensive database of empirical binding and non-binding peptide measurements for a large collection of HLAs.
Once the training is completed, the trained classifier model 322 can be configured to receive encoded variable-length peptides sequences 324 corresponding to a test viral protein and encoded variable-length protein sequences 326 corresponding to a plurality of HLAs in the human population. In an embodiment, a classification engine 328 may be configured to use the trained classifier model 322 to determine average binding predictions 330 of overlapping peptides at each position of the viral protein independently for each of a plurality of test HLAs comprising HLA-I and HLA-II functional groupings, where determining the average binding predictions includes classifying a peptide as a binder when an average binding prediction corresponding to the peptide satisfies a binding prediction threshold. The binding prediction value threshold may be 0.5. Peptides for which the threshold falls within a multiple of the standard deviation from the average binding prediction value may be classified as ambiguous.
For each of the variable-length peptides 406 and variable-length proteins 414 encoded by the GRU peptide encoder 404 and GRU HLA-I allele encoder 412, respectively, a fixed-length vector 416 is generated for input to one or more fully connected layers 418. The one or more fully connected layers 418, comprising a plurality of hidden neurons (nodes), may follow a hierarchical structure (not shown) to perform a classification on the features extracted from fixed length vector 416. As such, the fully connected layers 418 are configured to receive input fixed-length vector 416 and generate output values 420, which represent average binding predictions of overlapping peptides at each position of the viral protein. In exemplary embodiments, the fully connected layers 418 are further configured to label uncertain binding predictions as being ambiguous. Classification of binding predictions as ambiguous may be achieved by identifying when the binding classification threshold is within a multiple of the standard deviation of the mean prediction value from an ensemble of trained neural networks.
In an embodiment, a classification engine may be configured to train a classifier model comprising neural network architecture 400 using encoded variable-length peptides corresponding to the training proteins and encoded variable-length proteins corresponding to each of the plurality of training HLAs. For example, an output value 420 may be compared to a known labeled value (e.g., a known MHC-peptide binding affinity or binary binding value corresponding to the input encoded peptide sequence) to determine a loss or error factor (e.g., using a loss function) that can be used to determine parameter updates (e.g., apportion some derivative of the loss to incrementally adjust each of the weight values) within the fully connected layers 418 and improve the rate of learning and final accuracy of neural network architecture 400.
As described above, classifier model comprising neural network architecture 400 may be trained to predict MHC-peptide binding by converting each raw training peptide sequence into a set of training peptide sequences including each possible front padded and back padded iteration of the training peptide sequence when the training peptide sequence is padded to be equal in length to a fixed length input of a neural network.
While the neural network architecture illustrated in
For each of the variable-length peptides 506 and variable-length proteins 510 and 514 encoded by the GRU peptide encoder 504, GRU HLA-II alpha chain encoder 508, and GRU HLA-II beta chain encoder 512, respectively, a fixed-length vector 516 is generated for input to one or more fully connected layers 518. The one or more fully connected layers 518, comprising a plurality of hidden neurons (nodes), may follow a hierarchical structure (not shown) to perform a classification on the features extracted from fixed length vector 516. As such, the fully connected layers 518 are configured to receive input fixed-length vector 516 and generate output values 520, which represent binding predictions of individual peptides that may be combined to obtain average binding predictions of overlapping peptides at each position of the viral protein. In exemplary embodiments, an ensemble of multiple trained neural networks with different random seeds, with the same or varying architectures, may be applied for each prediction and the variance of their prediction values can be used to label uncertain binding predictions as being ambiguous.
In an embodiment, a classification engine may be configured to train a classifier model comprising neural network architecture 500 using encoded variable-length peptides corresponding to the training viral protein and encoded variable-length proteins corresponding to each of the plurality of training HLAs. For example, an output value 520 may be compared to a known labeled value (e.g., a known MHC-peptide binding affinity value corresponding to the input encoded peptide sequence) to determine a loss or error factor (e.g., using a loss function) that can be used to determine parameter updates (e.g., apportion some derivative of the loss to incrementally adjust each of the weight values) within the fully connected layers 518 and improve the rate of learning and final accuracy of neural network architecture 500.
While the neural network architecture illustrated in
Training engine 610 may then configure and train neural network-based classifier model 310, using encoded variable-length peptides sequences 1 to N 304, 306, and 308 and encoded variable-length protein sequences 1 to N 312, 314, and 316 corresponding to one or more HLA alleles 318 in the human population, to predict pan-allele MHC-peptide binding affinity for HLA I and HLA II alpha and beta chain sequences. For example, training engine 610 may configure classifier model 310 to determine, independently per HLA, an average binding prediction 320 of overlapping peptides at each position of the viral protein. In some embodiments, training engine 610 may also configure classifier model 310 to determine a maximum value of a binding prediction of overlapping peptides at each position of the viral protein, a standard deviation of a binding prediction of overlapping peptides at each position of the viral protein, and/or a combination of one or more the above.
Training engine 610 may also configure prediction engine 620 to receive encoded variable-length peptides sequences 324 corresponding to a test viral protein and encoded variable-length protein sequences 326 corresponding to a plurality of HLAs in the human population and use the trained classifier model 322 to determine pan-HLA binding of encoded variable-length peptides corresponding to a viral protein. In an embodiment, prediction engine 620 may configure classification engine 328 to use the trained classifier model 322 to determine average binding predictions 330 of overlapping peptides at each position of the viral protein independently for each of a plurality of test HLAs comprising HLA-I and HLA-II functional groupings, where determining the average binding predictions includes classifying a peptide as a binder when an average binding prediction corresponding to the peptide satisfies a binding value threshold. The average binding predictions 330 of overlapping peptides may be stored in either one or both of persistent storage device 630 and main memory device 640.
However, it should be noted that the elements in
At step 704, a classifier model trained to process encoded variable-length peptides is configured such that, independently per HLA, the classifier model is operable to determine at least one of (a) an average binding prediction of overlapping peptides at each position of a viral protein, (b) a maximum value of a binding prediction of overlapping peptides at each position of a viral protein, (c) standard deviation of a binding prediction of overlapping peptides at each position of a viral protein, and (d) a combination of one or more of (a)-(c). For example, the classifier model may be trained to make pan-HLA binding predictions using encoded variable-length peptides corresponding to a training viral protein and encoded variable-length proteins corresponding to one or more HLA alleles in the human population. Notably, the classifier model as disclosed herein may be trained to make pan-HLA binding predictions for HLA alleles in which little or no binding information is known using encoded variable-length proteins corresponding to one or more of a limited subset of HLA alleles in the human population.
At step 706, a classification engine is configured to use the classifier model to determine average binding predictions of overlapping peptides at each position of the viral protein independently for each of a plurality of test HLAs comprising HLA-I and HLA-II functional groupings, where the determining includes classifying a peptide as a binder when an average binding prediction corresponding to the peptide satisfies a binding prediction value threshold.
At step 804, the encoded variable-length peptides corresponding to the viral protein and the variable-length proteins corresponding to the plurality of test HLAs are processed using the classifier model such that, independently per test HLA, the classifier model is operable to determine an average binding prediction of overlapping peptides at each position of the viral protein.
In an embodiment, the operations of steps 806-814 are performed independently per test HLA. At step 806, average binding predictions are mapped in aggregate to locations along the test viral protein such that peptide-HLA interaction is indicated. In some use cases, the mapping of peptide-HLA interaction may further include indicating locations signifying co-occurrences of peptide attention and HLA attention.
Returning to step 808, nearest max locations are determined for the average binding predictions using a sliding window having a fixed length. For example, the fixed length of the sliding window may be based on a dominant peptide length of the variable-length peptides, e.g., 9-mers or 15-mers.
At step 810, top max regions are determined by selecting the nearest max locations having average binding predictions within a top percentage of values. For example, the top max regions may be determined by selecting the nearest max locations having average binding predictions within a top 10% of values, within a top 25% of values, or another selected top percentage of values. In
Returning to step 812, peptides classified as binders that overlap the top max regions are selected, and a pan-HLA max region is determined at step 814, where the determining includes setting unselected locations to zero, calculating a mean along an HLA axis of the average binding prediction, and selecting pan-HLA maxima within a top percentage of values based on the mean. For example, the pan-HLA max region may be determined by selecting pan-HLA maxima within a top 25% of values.
Returning to step 816, the selected peptides classified as binders are filtered independently for each of the HLA-I and HLA-II functional groupings to identify candidate peptides that overlap the top max regions based on an aggregate of the pan-HLA max regions, and, at step 818, one or more of the candidate peptides are included in an mRNA-based vaccine or therapeutic treatment for a patient. For example, the mRNA-based vaccine or therapeutic treatment may be capable of treating a patient having SARS-CoV-2, and at least one of the candidate peptides may be selected for inclusion in the mRNA-based vaccine or therapeutic treatment for the patient based on HLA allele frequencies in worldwide populations.
In an embodiment, the operations of steps 1808-1816 are performed independently per test HLA. At step 1808, average binding predictions are mapped in aggregate to locations along the test viral protein such that peptide-HLA interaction is indicated. At step 1810, nearest max locations are determined for the average binding predictions using a sliding window having a fixed length. At step 1812, top max regions are determined by selecting the nearest max locations having average binding predictions within a top percentage of values, and peptides classified as binders that overlap the top max regions are selected at step 1814. At step 1816, a pan-HLA max region is determined, where the determining includes setting unselected locations to zero, calculating a mean along an HLA axis of the average binding prediction, and selecting pan-HLA maxima within a top percentage of values based on the mean.
The selected peptides classified as binders are filtered independently for each of the HLA-I and HLA-II functional groupings to identify candidate peptides that overlap the top max regions based on an aggregate of the pan-HLA max regions at step 1818. At step 1820, an mRNA-based vaccine or therapeutic treatment comprising one or more of the candidate peptides is administered to a patient identified as having SARS-CoV-2. For example, at least one of the candidate peptides may be selected for inclusion in the mRNA-based vaccine or therapeutic treatment for the patient based on HLA allele frequencies in worldwide populations, and the plurality of test HLAs may correspond to HLA allele frequencies in worldwide populations. In some embodiments, the test viral protein may comprise a SARS-CoV-2 protein variant such as a SARS-CoV-2 nucleocapsid (N) protein variant or spike (S) protein variant, and the mRNA-based vaccine or therapeutic treatment may be capable of treating a patient having SARS-CoV-2.
A high-level block diagram of an exemplary client-server relationship that may be used to implement systems, apparatus and methods described herein is illustrated in
For example, client 1910, in accordance with the various embodiments described above, may obtain a viral protein encoded into variable-length peptides, and a plurality of HLAs encoded into variable-length proteins, where the plurality of HLAs may comprise HLA-I and HLA-II functional groupings.
Server 1920 may configure a classifier model trained to process encoded variable-length peptides such that, independently per HLA, the classifier model is operable to determine at least one of (a) an average binding prediction of overlapping peptides at each position of the viral protein, (b) a maximum value of a binding prediction of overlapping peptides at each position of the viral protein, (c) standard deviation of a binding prediction of overlapping peptides at each position of the viral protein, and (d) a combination of one or more of (a)-(c); and configure a classification engine to use the classifier model to determine average binding predictions of overlapping peptides at each position of the viral protein independently for each of a plurality of test HLAs comprising HLA-I and HLA-II functional groupings, where the determining includes classifying a peptide as a binder when an average binding prediction corresponding to the peptide satisfies a binding value threshold.
Server 1920 may further obtain a plurality of test HLAs encoded into variable-length proteins, where the plurality of test HLAs comprises HLA-I and HLA-II functional groupings, and process the encoded variable-length peptides corresponding to the viral protein and the variable-length proteins corresponding to the plurality of test HLAs using the classifier model such that, independently per test HLA, the classifier model is operable to determine an average binding prediction of overlapping peptides at each position of the viral protein.
Independently per test HLA, Server 1920 may map in aggregate average binding predictions to locations along the test viral protein such that peptide-HLA interaction is indicated; determine nearest max locations for the average binding predictions using a sliding window having a fixed length; determine top max regions by selecting the nearest max locations having average binding predictions within a top percentage of values; select peptides classified as binders that overlap the top max regions; and determine a pan-HLA max region, where the determining includes setting unselected locations to zero, calculating a mean along an HLA axis of the average binding prediction, and selecting pan-HLA maxima within a top percentage of values based on the mean.
Independently for each of the HLA-I and HLA-II functional groupings, server 1920 may filter the selected peptides classified as binders to identify candidate peptides that overlap the top max regions based on an aggregate of the pan-HLA max regions, where one or more of the candidate peptides may be included in an mRNA-based vaccine or therapeutic treatment for a patient.
One skilled in the art will appreciate that the exemplary client-server relationship illustrated in
Systems, apparatus, and methods described herein may be implemented using a computer program product tangibly embodied in an information carrier, e.g., in a non-transitory machine-readable storage device, for execution by a programmable processor; and the method steps described herein, including one or more of the steps of
A high-level block diagram of an exemplary apparatus that may be used to implement systems, apparatus and methods described herein is illustrated in
Processor 2010 may include both general and special purpose microprocessors and may be the sole processor or one of multiple processors of apparatus 2000. Processor 2010 may comprise one or more central processing units (CPUs), and one or more graphics processing units (GPUs), which, for example, may work separately from and/or multi-task with one or more CPUs to accelerate processing, e.g., for various image processing applications described herein. Processor 2010, persistent storage device 2020, and/or main memory device 2030 may include, be supplemented by, or incorporated in, one or more application-specific integrated circuits (ASICs) and/or one or more field programmable gate arrays (FPGAs).
Persistent storage device 2020 and main memory device 2030 each comprise a tangible non-transitory computer readable storage medium. Persistent storage device 2020, and main memory device 2030, may each include high-speed random access memory, such as dynamic random access memory (DRAM), static random access memory (SRAM), double data rate synchronous dynamic random access memory (DDR RAM), or other random access solid state memory devices, and may include non-volatile memory, such as one or more magnetic disk storage devices such as internal hard disks and removable disks, magneto-optical disk storage devices, optical disk storage devices, flash memory devices, semiconductor memory devices, such as erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), digital versatile disc read-only memory (DVD-ROM) disks, or other non-volatile solid state storage devices.
Input/output devices 2090 may include peripherals, such as a printer, scanner, display screen, etc. For example, input/output devices 2090 may include a display device such as a cathode ray tube (CRT), plasma or liquid crystal display (LCD) monitor for displaying information (e.g., a DNA accessibility prediction result) to a user, a keyboard, and a pointing device such as a mouse or a trackball by which the user can provide input to apparatus 2000.
Any or all of the systems and apparatuses discussed herein, including training engine 610 and prediction engine 620 may be performed by, and/or incorporated in, an apparatus such as apparatus 2000. Further, apparatus 2000 may utilize one or more neural networks or other deep-learning techniques to perform training engine 610 and prediction engine 620 or other systems or apparatuses discussed herein.
One skilled in the art will recognize that an implementation of an actual computer or computer system may have other structures and may contain other components as well, and that
The foregoing specification is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the specification, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention.
This application claims priority to U.S. Provisional Patent Application Ser. No. 63/148,609, filed Feb. 12, 2021, titled “HLA CLUSTERS, GLOBAL FREQUENCIES, AND BINDING ACROSS SARS-CoV-2 VARIATION”, and U.S. Provisional Patent Application Ser. No. 63/195,660, filed Jun. 1, 2021, titled “HLA CLUSTERS, GLOBAL FREQUENCIES, AND BINDING ACROSS SARS-CoV-2 VARIATION”. The contents of both applications are hereby incorporated by reference in their entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
63195660 | Jun 2021 | US | |
63148609 | Feb 2021 | US |