The present teachings relate to vaccines, and more particularly to a risk viral strain identification strategy that can be utilized to develop influenza vaccines that exhibit both high yield and antigenicity matching the most prevalent wild influenza strains.
The statements in this section merely provide background information related to the present disclosure and cannot constitute prior art.
Vaccination is the primary strategy for preventing influenza vaccinations. Influenza vaccines are produced using vaccine ‘seeds,’ which are specific influenza virus strains selected as a starting material for mass vaccine production. Ideal influenza vaccine seeds possess genetic stability, high growth rates in production systems, and strong antigenic similarity to the most prevalent circulating influenza strains, also called “antigenic matching.” To achieve high antigenic match, vaccine design can look to specific influenza virus proteins. Hemagglutinin (HA) and neuraminidase (NA) are two key proteins on the surface of influenza viruses that significantly affect such virus's ability to infect host cells and spread, can undergo antigenic drift or shift. However, HA and NA can undergo antigenic drift or shift, enabling viruses to evade host immunity elicited by natural infections and or vaccinations.
Thus, annual updates of vaccine composition are necessitated to match vaccine seed virus antigenicity with that of circulating viruses. This is a time consuming process that requires global collaborative efforts coordinated through the World Health Organization (WHO) Global Influenza Surveillance and Response System. Timely selection of effective high-yield vaccine seeds is critical for seasonal influenza vaccine manufacturing. In the 2003-2004 influenza season, the predicted vaccine strain A/Fujian/411/2002-like virus recommended by WHO was unable to grow to sufficient titers in eggs causing great pressures in virus production. During the 2009 H1 N1 pandemic, vaccine supply was delayed due to the poor yield of the vaccine virus, and a global vaccine campaign was not initiated until a high-yield vaccine strain was available after the second pandemic wave.
Extant strategies for finding vaccine seeds that exhibit high stability, high yield, and high antigenic matching have significant shortcomings. The conventional strategy to achieve high yield, for example, often involves additional passages in eggs or cells, and genetic approaches rely on reassortment with a donor strain that exhibits high yield traits in eggs or cells. Both approaches may take up to 6 months as well as have limitations. Egg or cell adaptation can result in undesired antigenic changes due to additional mutations in HA and/or NA. Genetic modification strategies do not always lead to substantial improvements in yield. Therefore, selecting naturally circulating influenza vaccine strains with high-yield phenotypes directly from clinical samples, without requiring additional engineering, would be ideal and could potentially accelerate vaccine strain selection process for timely vaccine production.
Over the past few years, several computational models have been developed to identify influenza antigenic variants using genomic sequences. However, none of these models can be used to directly identify antigenic match and high-yield viruses based on genetic sequences.
Based on the foregoing, there is a need for a new approach to overcome challenges in influenza vaccine strain selection.
Described herein is a machine learning algorithm, Machine-learning Assisted Influenza VaccinE Strain Selection framework (MAIVeSS), that enables streamlined selection of naturally circulating, antigenic match, and high-yield influenza vaccine strains directly from clinical samples by using molecular signatures of antigenicity and virus yield in the hemagglutinin (HA) of influenza A virus. Using publicly available sequences, MAIVeSS predicted potential seed viruses with antigenicity matching to the 2009 H1N1 viruses (A(H1N1)pdm09) in circulation. The wet experiments confirmed that these seed viruses grew in high yield in both cells and eggs. MAIVeSS can potentially reduce the influenza high-yield seed vaccine selection time from months to just a few days and thus facilitate timely supply of seasonal vaccines.
In one particular aspect, the present disclosure is directed to a method for identifying one or more preferred viral strains for vaccine development, said method comprising: sequencing hemagglutinin present in one or more circulating viral strains to generate one or more circulating hemagglutinin sequences; sequencing hemagglutinin present in each of one or more candidate viral strains to generate one or more candidate hemagglutinin sequences; providing an input to a machine learning algorithm, wherein the input comprises the one or more circulating hemagglutinin sequences and the one or more candidate hemagglutinin sequences; using the machine learning algorithm to predict one or more desired phenotypes selected from the group consisting of: antigenic difference values between each of the one or more circulating viral strains and each of the one or more candidate viral strains; and an egg yield value, a cell yield value, and a combined yield value for each of the one or more candidate viral strains; identifying one or more preferred viral strains from among the one or more candidate viral strains by identifying each of the one or more candidate viral strains that are predicted to have one or more desired phenotypes selected from the group consisting of: low antigenic difference values; and a high egg yield value, cell yield value, or combined yield value.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Corresponding reference numerals will be used throughout the several figures of the drawings.
The following detailed description illustrates the claimed invention by way of example and not by way of limitation. This description will clearly enable one skilled in the art to make and use the claimed invention, and describes several embodiments, adaptations, variations, alternatives and uses of the claimed invention, including what is presently believed is the best mode of carrying out the claimed invention. Additionally, it is to be understood that the claimed invention is not limited in its applications to the details of construction and the arrangements of components set forth in the following description or illustrated in the drawings. The claimed invention is capable of other embodiments and of being practiced or being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.
The methods for identifying preferred viral strains for influenza vaccine development disclosed herein require the development of a machine-learning framework referred to herein as MAIVeSS. Thus, the following disclosure will provide methods of construction of the described machine-learning framework. Following this, methods of use of the MAIVeSS framework for identifying vaccine seed virus candidates will be provided.
The exemplary mutant library 111 comprises 822 plasmids with each plasmid carrying between one and seven random mutations within or near the HA receptor binding site (RBS). However, one of ordinary skill can envision a different size of the mutant library 111 as well as a different number of random mutations without deviating from the principle of the described MAIVeSS framework. Thus, in various exemplary embodiments, the mutant library 111 can comprise more than 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, or 10,000 genetic materials such as plasmids, and each genetic material can carry 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100 or more mutations within or near the HA RBS.
In the phenotyping step 120, the mutant library 111 is subjected to desired phenotype analyses such as antigenic analyses 121, yield analyses 122, and glycan profiling 123. The exemplary antigenic analyses 121 provide information on the extent of antigenic relationships between genetic materials in the mutant library 111 as well as the level of antigenic drift. The antigenic analyses 121 first require that viruses are generated from corresponding genetic material in the mutant library 111. A total of 189 corresponding mutant viruses are generated from the 822 plasmids in the mutant library 111 via reverse genetics. After the viruses are sequenced, the antigenic analyses 121 of these viruses are conducted by performing hemagglutination inhibition (HAI) assays using ferret antisera. By comparing antigenic relationships of the viruses generated from the mutant library 111 against known wild influenza strains, for example, one can identify which viruses have the most closely matched antigenicity to the wild influenza strains. One of ordinary skill in the art can envision variations of the antigenic analyses 121 that are considered to be within the scope of the present description. Thus, in various exemplary embodiments, one could generate up to 10 viruses, more than 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 120, 150, 200, 300, 400 or more viruses. In various exemplary embodiments, the antigenic analyses 121 could make use of other animal antisera and/or assays that target other proteins relevant to antigenicity, such as neuraminidase.
The exemplary yield analyses 122 are performed in order to determine how substitutions present in viruses generated from the mutant library 111 affect the yields of those viruses in both cells and eggs when compared to parent wild type virus. The yield analyses 122 provide information on ideal vaccine seed virus candidates from the perspective of ease of production, with high-yield production in both cells and eggs being ideal.
The exemplary glycan profiling 123 is performed in order to acquire more data potentially relevant to viral yield, which can correlate with glycan substructure binding properties. The glycan profiling 123 is thus conducted by analyzing the receptor binding properties of the mutant viruses using glycan microarrays comprising 75 glycoforms. This is followed by a matrix of 28 glycan substructure features to group glycans based on their internal and terminal substructures. The glycan profiling 123 also comprises biolayer interferometry assays (BLI) which are used for determining virus receptor binding affinities. Clearly, the above example of the glycan profiling 123 cites specific values that one of ordinary skill in the art will recognize are significantly alterable without departing from the principle of the present description. Thus, in various exemplary embodiments, one can perform glycan profiling using up to 10, more than 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 300, 500, 1000, or more glycoforms, and by studying 10, more than 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 300, 500, 1000, or more glycan substructure features. Additionally, this description contemplates that the glycan profiling 123 could utilize other forms of analysis including, for example, instrumental analyses such as surface plasmon resonance (SPR), Biolayer Interferometry (BLI) and computational methods such as molecular dynamics simulations.
Once the phenotyping step 120 is complete and data related to the desired phenotypes such as antigenicity, yield, and glycan binding properties of mutant viruses generated from the mutant library 111 have been collected, the feature selection step 130 takes place. The feature selection step 130 considers which features are to be mathematically integrated into a MAIVeSS predictive model. Thus, the feature selection step 130 comprises selection of exemplary antigenicity features 132, yield features 133, and receptor binding features 134, all of which are selected by a sparse learning model 131. The sparse learning model 131 enables identification, from all collected genetic data, of the key genetic features that contribute to the phenotypes observed in the phenotyping step 120. Briefly, the sparse learning model 131 uses a linear regression loss function with regularization, which permits determination of the most relevant genetic features associated with a given phenotype. The sparse learning model 131 takes into account genetic distance matrices among proteins or glycan sequences, phenotypic differences, and sample numbers. Specific embodiments of the sparse learning model 131 used in selecting antigenicity features 132, yield features 133 and receptor binding features 134 are described in detail in the Examples below.
The exemplary antigenicity features 131 comprise two sets of features, which are amino acid mutations and N-glycosylation sites. In various exemplary embodiments, these two sets of features are further narrowed so that the antigenicity features 131 specifically comprise residues predicted to be on the surface of HA.
The exemplary yield features 132 are selected to identify mutations associated with virus yields in cells as well as in eggs. The yield features 132 utilize two sets of features: amino acid substitutions and N-glycosylation sites. The sparse learning model 131, in various embodiments, is also used to identify synergistic amino acid substitutions associated with virus yield in eggs or cells.
The exemplary receptor binding features 134 are selected with the goal of determining the substructures associated with yield traits in cells and eggs. The receptor binding features 134 comprise terminal substructures, internal substructures, and base substructures which are linked to the glycan microarray.
Once the sparse learning model 131 has been used for feature selection, a predictive model 141 is generated. Essentially, the features selected via the sparse learning model form the basis for the predictive model 141. The predictive model 141 is developed to estimate the antigenic distance between two viruses based on their genetic sequences. Thus, in various exemplary embodiments, HA protein sequences can be used as input data to the predictive model 141. Output from the predictive model 141 includes quantified antigenic distance data 142 and yield difference data 143. Output from the predictive model 141 is calculated on the basis of a genetic distance vector between two viruses, a predicted antigenic distance between the two viruses, a global weight representing an average of weights across different tasks, weights from each individual task, and a weighting parameter p.
The predictive model 141, when used to predict antigenic distance data 142, is generally defined according to Equation 1 below:
Output antigenic distance data 142 from the predictive model 141 as calculated according to Equation 1 considers a genetic distance vector between two viruses (x), a predicted antigenic distance between the two viruses (ŷ), a global weight representing an average of weights across different tasks (wglobal), weights from each individual task(Wlocal), and a weighting parameter p.
The predictive model 141, when used to predict yield difference data 143, also incorporates a scoring function that is generally defined according to Equation 2 below:
Output yield difference data 143 from the predictive model 141 as calculated incorporating Equation 2 considers w and z as weight and feature matrices, respectively, from the sparse learning model 131. Specific examples of the use and incorporation of Equation 2 into the predictive model 141 are described in detail in the Examples below.
The MAIVeSS method can be used to predict preferred viral strains for vaccine development by focusing analysis on key amino acid residues it has identified to be related to antigenicity and yield. That is, the antigenicity features 132, yield features 133 and receptor binding features 134 selected by the sparse learning model 131 can be used by the predictive model 141 to predict preferred viral strains via output quantified antigenic distance data 142 and yield difference data 143.
Hence, in various exemplary embodiments, one can use the amino acid residues associated with antigenicity in Table 1 below to assess antigenicity of vaccine candidate viruses and thus predict which viruses will have acceptable or optimal antigen matching with the most prevalent wild strains of a virus. In Table 1, the following abbreviations are used to further describe residues: antibody binding site (ABS); receptor binding site (RBS); N-linked glycosylation (Gly); Calcium binding site 1 (Ca1); Calcium binding site 2 (Ca2); Site A (Sa); and Site B (Sb). Bootstrap values shown in Table 1 were derived from 100 independent experiments, each with 80% of the training data.
Global weights (wglobal) as listed in Table 1 were learned from an application of the sparse learning model 131 called MTL-GGSL, further detailed in the Examples below, and absolute local weights for each individual task are provided in U.S. provisional patent application Ser. No. 63/578,043, which is hereby incorporated by reference in its entirety.
In various exemplary embodiments, one can use the amino acid residues associated with yield in Table 2 below to predict which viruses will have acceptable or optimal yields for the purpose of vaccine production.
Table 2 uses the following abbreviations for amino acids: nonpolar (NP), which includes amino acids valine (V), leucine (L), isoleucine (1), methionine (M), cysteine (C), phenylalanine (F), tryptophan (W) and tyrosine (Y); small polar (SP), which includes amino acids glycine (G), alanine (A) and proline (P); and polar/charged (P), which includes amino acids serine (8), threonine (T), asparagine (N), glutamine (0), histidine (H), aspartic acid (D), glutamic acid (E), lysine (K) and arginine (R). Substitutions listed in Table 2 that are associated with HYboth or LYboth are bolded. Cells in Table 2 with ‘—’ denotes that it is unknown how the corresponding residue will affect yield.
The following examples comprise descriptions of exemplary embodiments of the herein disclosed method of analysis. These examples are not intended to be limiting or to define the scope of the present disclosure.
Machine learning models have been shown to be effective in identifying antigenicity associated features in protein sequences from different subtypes of influenza A viruses. Machine learning models were developed to identify the specific sequence features in HA proteins that determine three important phenotypes: antigenicity, yield in cells and eggs, and receptor binding. To achieve this, the models were trained on large datasets of HA protein sequences and associated phenotype information. A quantitative function was developed that allows us to measure the distances between sequences based on their phenotypic characteristics. Ultimate goals for these machine learning models are to identify: 1) mutations in the HA RBS that affect virus antigenicity; 2) mutations in the HA RBS that increase or decrease virus yields in cells and/or eggs; and 3) specific glycan substructures (glycan motifs) on the surface of cells or eggs that are associated with increased yields of influenza virus. By achieving these goals, the hope is to gain a better understanding of the molecular determinants of these important viral phenotypes and to identify potential targets for the development of improved influenza vaccines.
The problem of identifying genetic features associated with influenza virus phenotypes was approached using a sparse learning model. Mathematically, this model involves a linear regression loss function with regularization, which allows us to determine the most relevant genetic features associated with a given phenotype. The sparse learning model combines a least squares loss with a regularized term and takes into account genetic distance matrices among HA proteins or glycan sequences (denoted as X), phenotypic differences (denoted as y), and sample numbers (denoted as N). This approach enables us to identify the key genetic features that contribute to different influenza virus phenotypes, such as antigenicity, yield, and receptor-binding.
The objective of the sparse learning model is to solve: min L(X, y, w)+λR(w) where L(X,y,w) is the loss function, A is a pre-defined regularization parameter, R(w) denotes the regularization term, and denotes the numerical weights of individual features (either a single residue or a group of neighboring residues). Absolute values of the weights indicate the impact of each mutation of a specific feature to phenotypes (i.e., antigenic, yield, and receptor-binding properties). The larger the absolute weight, the greater the impact.
Based on the features learned from sparse learning, a predictive model was developed to assess antigenic or yield properties given HA sequences. Specifically, ŷ=xw, where ŷ is the predicted phenotypic distance (either antigenicity or yield) between the two viruses; x is the feature distance vector; and w is the weight vector for those features, which can be associated with either antigenicity or yield.
Multi-task learning group-guided sparse learning (MTL-GGSL) model. To address the challenges associated with integrating serological data generated from different platforms (e.g., turkey and guinea red blood cells), a Multi-Task Learning (MTL) approach was utilized with Group Graphical Sparse Learning (GGSL) to analyze antigenicity. This approach allowed us to consider both N-linked glycosylation and amino acid features when analyzing the data. MTL allows us to learn multiple related tasks (i.e., analyzing antigenicity from different serological platforms) simultaneously, while GGSL considers the dependencies between different groups of features to improve the accuracy of the analysis. By utilizing MTL-GGSL, one is able to overcome the challenges associated with integrating data from different platforms and provide a more comprehensive analysis of antigenicity.
One advantage of using the group LASSO regularization in MTL-GGSL for antigenicity analyses is that it encourages multiple predictors from related tasks to share a subset of features. This is in contrast to the LASSO regularization, which may lead to sparse solutions where only a few features are selected for each task independently. Previous study has shown that incorporating information on N-linked glycosylation can improve the performance of sparse learning models in predicting antigenic properties of influenza viruses.
By adopting MTL-GGSL, one is able to integrate information on both glycosylation and amino acid sequences from serological data generated using different platforms, which can further enhance the accuracy of the predictive models described herein for influenza antigenicity.
Specifically, the following was defined:
and the model is formulated as:
where λ1, λ2, and λ3 are regularization parameters, j is the subscript for feature, p is the total number of features, Gl denotes the feature group, q is the number of feature groups, αl=√{square root over (ml)} is the weight of feature group Gl; Wj denotes the weights for the j-th feature among different tasks, and WGl,t is the weight for feature group Gl of the t-th task. Alternating Direction Method of Multipliers (ADMM) was employed to solve the optimization problem.
To consider synergistic effects of multiple features on the phenotypes, GHSM was adopted. The GHSM model aims to minimize:
GHSM model solves the following objective:
Above, λ and α are two regularization parameters controlling the sparsity and the decay in the coefficients for interactions of different orders, Z<i
K is a vector of length
with w<i
In this Example, eight individual tasks were used, each corresponding to an individual HAI dataset, including those for seasonal H1N1 viruses (1977-2009), 2009 H1 N1 viruses (2009-2020), swine H1 N1 viruses, and mutants generated above. In each task, the low-rank matrix completion algorithm was used to minimize data noise and the challenges derived from low reactors and missing values in the HAI datasets, and antigenic cartography was then used for antigenic distance calculation. Two groups of features (i.e., amino acid mutations and N-glycosylation sites) were used in the model to quantify influenza antigenic distances. 327 residue features and 6 N-glycosylation site features were defined. GETAREA software was used to predict whether or not residues were on HA's surface. The A(H1N1)pdm09 three-dimensional HA structure (Protein Data Bank [PDB] identifier [ID]3LZG) was used as the template. A total of 138 residues were predicted to be located at the HA protein's surface (
In this Example, the yield of 189 mutants was analyzed compared with the parent wild type virus in both cells and eggs. To analyze the data, two groups of features were utilized: amino acid substitutions and N-glycosylation sites. Furthermore, the GHSM approach was employed to identify synergistic amino acid substitutions associated with virus yield in eggs or cells.
In this Example, a glycan microarray with 75 glycoforms was used, which were grouped based on their internal and terminal substructures and linkers into a matrix of 28 glycan substructure features. The Multi-Task Learning with Group Graphical Sparse Learning (MTL-GGSL) approach was then used to determine the substructures associated with yield traits in cells and eggs. In the model, three groups of features were employed, including terminal substructures (n=17), internal substructures (n=8), and base substructures linked to the array (n=3).
In order to ensure the robustness of these analyses, these results were compared with three other sparse models: the L1-norm regularized method (LASSO),the L2-norm regularized method (RIDGE), and the sparse group LASSO method (SGL). Results were also compared with two additional methods that incorporate both L1- and L2-norm regularization, including the L1- and L2-norm regularized method41 and the L1- and L∞-norm Composite Absolute Penalties method (iCAP).
To investigate the effect of amino acid substitutions on both yield and glycan binding phenotype, a grouping method for amino acids was employed. Each amino acid was assigned to one of three groups based on its biophysical properties: nonpolar (V, L, I, M, C, F, W, and Y), small polar (G, A, and P), and polar/charged (S, T, N, Q, H, D, E, K, and R). HA protein sequence was encoded into a vector by comparing to a wild-type sequence and if a mutation occurred in residue j (e.g., nonpolar to small polar), the j-th element of was encoded to 1; otherwise, it was encoded to 0. To evaluate the directionality of amino acid substitutions on both yield and glycan binding phenotype, three different sparse models (LASSO, RIDGE, and SGL) were used and parameter optimization and bootstrap analyses were performed. In brief, all features with A bootstrap value cutoff of 80 from 100 independent runs were selected.
In this Example, a predictive model was developed to estimate the antigenic distance between two viruses based on their genetic sequences. The model was defined as follows: ŷ=x(μwglobal+(1−μ)wlocal), where x is the genetic distance vector between the two viruses, ŷ is the predicted antigenic distance between them, Wglobal is the global weight representing the average of weights across different tasks, wlocal indicates the weights from each individual task, and μ is set to 0.4 to balance the global and local weights.
In addition, a scoring function was proposed to measure yield differences between two viruses based on their amino 398 acid sequences. The scoring function is defined as follows:
Here, w and z were the weight and feature matrices used in the GHSM approach mentioned above. The detailed prediction results for both the antigenic distance and yield differences are presented in U.S. provisional patent application Ser. No. 63/578,043, referenced herein.
405 The human embryonic kidney (293T) cells and Madin-Darby canine kidney (MDCK) CCL-34 cells were obtained 406 from the American Type Culture Collection (Manassas, VA). The cells were maintained in Dulbecco's Modified 407 Eagle Medium (GIBCO/BRL, Grand Island, NY) supplemented with 5% fetal bovine serum (Atlanta Biologicals, 408 Lawrenceville, GA) and penicillin-streptomycin (Invitrogen, Carlsbad, CA) at 37° C. with 5% C02. The HA gene 409 of CA/04 was cloned into the vector pHW2000 and used as a template to construct the mutant library. The viruses 410 generated by reverse genetics were propagated in MDCK cells and cultured at 37° C. with 5% C02 in Opti-MEM 411 medium (GIBCO/BRL, Grand Island, NY) supplemented with 1 μg/ml of TPCK (N-tosyl-L-phenylalanine 412 chloromethyl ketone)-Trypsin (Sigma-Aldrich, St. Louis, MO) and penicillin-streptomycin (Invitrogen, Carlsbad, 413 CA). The virus titers were determined by TCID50 in MDCK cells.
Serologic data for H1 N1 viruses were collected from data described elsewhere including 2,030 HAI titers generated between 153 viruses and 97 serum samples. A total of 3,080 non-identical 2009 H1 N1 protein sequences from 2019 to 2020 were obtained from GISAID (https://gisaid.org).
The mutant plasmid library with random mutations in the HA RBS was generated using the epPCR strategy, as previously described. Four primers were used to generate the HA-pHW2000 RBS mutant library: 1) 130loop_F: 5′-TCA TGG CCC AAT CAT GAC TCG AAC-3′ (SEQ ID NO:1); 2) 190helix_F: 5′-TGG GGC ATT CAC CAT CCA TCT ACT-3′ (SEQ ID NO:2); 3) 190helix_R: 5′-AAC ATA TGT ATC TGC ATT CTG ATA-3′ (SEQ ID NO:3); and 4) 220loop_R: 5′-TAG TGT CCA GTA ATA GTT CAT TCT-3′ (SEQ ID NO:4). The epPCR product (2 μl) was transfected into XL1-Blue Supercompetent Cells (Agilent Technologies, Santa Clara, CA). The transformed cells were directly inoculated onto LB (Luria Bertani) agar plates, and the clones were propagated in 5 ml of LB media. The clones generated from the RBS mutant library were confirmed by Sanger sequencing using the sequencing primer 5′-GAA CGT GTT ACC CAG GAG ATT-3′ (SEQ ID NO:5). Mutant viruses were rescued by plasmid-based reverse genetics with the NA genes from CA/04 and six internal genes from PR8, as described elsewhere.
To compare the phenotypes of the predicted vaccine candidates, a wild-type reassortant virus (rg-WT) was also generated with wild-type HA and NA genes from CA/04 and six internal genes from PR8 using reverse genetics.
To validate the antigenic and high-yield properties of the viruses predicted by the computational model described herein, the HA and NA genes for four potential vaccine candidates were synthesized from epidemic strains (Gene Universal Inc., Newark, DE) and then generated reassortant viruses with the HA and NA from each of these testing epidemic strains and the six internal genes from PR8 using reverse genetics: A/Saint-Petersburg/R1157/2016(H1 N1) (HA,NA)×PR8(rgSP/16), A/Chongqing-Yuzhong/SWL1453/2017(H1 N1)(HA,NA)×PR8(rgCQ/17), A/Brunei/25/2019(H1 N1)(HA,NA)×PR8(rgBRU/19), and A/Malaysia/33075487/2020(H1 N1)(HA,NA)×PR8(rgMAS/20).
To evaluate the effect of mutations on viral yield, cell culture assays and embryonated egg assays were performed. For the cell culture assays, MDCK cells were inoculated with each influenza virus at a multiplicity of infection of 0.001 TCID50 and incubated the cells at 37° C. with 5% C02 for 1 hour. After incubation, the inocula were removed, and the cells were washed twice with phosphate-buffered saline (PBS). Then, the cells were incubated with Opti-447 MEM I (GIBCO, Grand Island, NY) containing TPCK-trypsin (1 μg/ml) at 37° C. with 5% C02. After 48 hours, 200 μl of supernatants were collected, aliquoted, and stored at −80° C. until use. For the embryonated egg assays, 9-day-old specific pathogen-free chicken eggs were inoculated with 200 TCID50 of each virus and incubated at 37° C. for 72 hours, and allantoic fluid were collected. The viral titers in the samples from both the MDCK cells and the embryonated eggs were determined using TCID50 assays in MDCK cells.
Viruses for the glycan microarray analysis were purified as previously described. Briefly, viruses were purified from the cell supernatant or allantoic fluid by low-speed clarification (2,482×g, 30 min, 4° C.) to remove debris and then followed by ultracentrifugation through a cushion of 30%-60% sucrose in a 70 Ti Rotor (Beckman Coulter, Fullerton, CA) (100,000×g, 3 h, 4° C.). The virus pellet was re-suspended in 100 μl of PBS and stored at −80° C. until use.
To identify unique substructures bound specific sets of mutants, a glycan microarray with 75 glycoforms were printed on N-hydroxysuccinimide (NHS)-derivatized slides as described previously.
The 75 glycans were selected to represent four different glycan categories, including N-glycans, Asn-linked N-glycans, Gangliosides, Thr-linked O-mannosyl glycans (
Haemagglutination and HAI assays.
Haemagglutination and HAI assays were performed by using 0.5% turkey erythrocytes as described by the WHO Global Influenza Surveillance Network Manual for the Laboratory Diagnosis and Virological Surveillance of Influenza. Turkey erythrocytes were obtained from Lampire Biological Products (Everett, PA). The turkey erythrocytes were washed three times with 1×PBS (pH 7.2) before use and then diluted to 0.5% in 1×PBS (pH 483 7.2).
The virus receptor binding affinities were determined by BLI with an Octet RED instrument (Pall ForteBio, Menlo 487 Park, CA). Five biotinylated glycan analogs, Neu5Acα2-3Galβ1-4GlcNacβ-PAA-biotin (3′SLN), Neu5Acα2-488 6Galβ1-4GlcNacβ-PAA-biotin (6′SLN) (Lectinity Holdings, Moscow, Russia), Neu5Acα2-3Galβ1-4(Fucβ1-3)GlcNacβ-PAA-biotin (sLeX), Neu5Gca2-3Galβ1-4GlcNAcβ-PAA-biotin (3′SLN(Gc)), or Neu5Gcα2-3Galβ1-4(Fucβ1-3]GlcNAcβ-PAA-biotin (SLeX(Gc))] were used. Among them, SLeX, 3′SLN(Gc), and SLeX(Gc)were synthesized. The glycans were preloaded onto streptavidin-coated biosensors at up to 0.3 μg/ml for 3 minutes in 1×kinetic buffer (Pall FortéBio, Menlo Park, CA). Each test virus was diluted to a final concentration of 100 pM with 1×kinetic buffer containing 10 μM oseltamivir carboxylate (American Radiolabeled Chemicals, St. Louis, MO) and zanamivir (Sigma-Aldrich, St. Louis, MO) to prevent cleavage of the receptor analogs by NA proteins of virus. Association was measured for 30 minutes at 25° C. Responses were normalized by the highest value obtained during the experiment, and binding curves were fitted by using the binding-saturation method in GraphPad Prism 8. The normalized response curves report the fractional saturation (f) of the sensor surface as described in elsewhere. The RSL0.5 values were calculated to determine the binding affinity between a virus and glycan analog pair, using the binding-saturation method in GraphPad Prism 8 software. Higher RSL0.5 values indicate weaker binding affinity between the virus and glycan analog.
The three-dimensional structure of HA protein was modeled based on the crystal structure of CA/04 HA in complex with 6′SLN (PDB ID #3UBN) and 3′SLN (PDB ID #3UBQ). Coot was first used to introduce the desired mutation to the three subunits of a HA trimer. The mutated coordinates were subsequently refined by energy minimization using Phenix (https://phenix-online.org). Structure figures were made using Pymol (The PyMOL Molecular Graphics System, Version 1 0.3, Schrödinger, LLC).
Serological data for exemplary vaccine candidates generated using the methods described herein are provided in Table 3 below. The HAI assays were performed in triplicate using 0.5% turkey red blood cells. The homologous HAI titers are highlighted in bold. Ferret antisera were produced by infecting influenza seronegative ferrets (see details in Supplementary Information) or obtained from BEI Resources or International Reagent Resource.
320.00
640.00
1280.00
1280.00
2560.00
This Example aimed to develop MAIVeSS to learn genetic features associated with three key biological properties for influenza viruses: antigenicity, yield and receptor-binding (
Using the features learned, MAIVeSS scored vaccine candidates using a query HA protein sequence based on two properties: (1) antigenic properties related to the prototype vaccine antigen, and (2) yield properties in eggs and/or cells (HYcell, high-yield in cells; HYegg, high-yield in eggs; HYboth, high-yield in both cells and eggs). High-yield is defined as a >10-fold increase in TCID50/mL compared to the wild-type (WT) in the same substrate. By leveraging these predictive models, MAIVeSS can rapidly identify influenza vaccine candidates that are both antigenically matched and high-yield based on genome sequences obtained during surveillance.
The effectiveness of the machine learning models described herein using A(H1 N1)pdm09 viruses was studied as an exemplary application, but the same principles can be readily applied to other subtypes of influenza viruses.
Development of an a(H1N1)Pdm09 Mutant Library for Machine Learning
To enhance the reliability of feature selection for high-yield viruses, a random mutant virus library that targets the HA receptor binding site (RBS) of A/California/04/2009(H1N1) (CA/04) was established. All the mutants were subjected to antigenic analyses via hemagglutination inhibition (HAI) assays, yield analyses in both MDCK cells and embryonated chicken eggs, and receptor-binding profiling through glycan microarrays. The phenotypic data collected were then used as training and testing data in MAIVeSS to identify the molecular features associated with antigenicity and yield and to establish predictive models.
A total of 822 plasmids were generated, each carrying one to seven random mutations within or near the HA RBS (residues 119-241, H1 numbering; 126-244, H3 numbering). Using these mutant plasmids, corresponding mutant viruses were then generated via reverse genetics. Rescued mutant viruses bear the NA gene from CA/04 and the remaining 6 remaining 6 gene segments from A/Puerto Rico/8/1934(H1N1) (PR8). After three passages, a total of 189 mutant viruses bearing unique amino acid substitutions with different biochemical properties were generated (
To determine the antigenic properties of mutant viruses generated, HAI assays were performed using ferret antisera. Out of 189 mutant viruses, only 5 mutants had significant changes in their antigenic properties by showing a ≥4-fold reduction in their HAI titers compared to WT. These 5 antigenically distinct mutants had at least one substitution in the HA ABS, with other substitutions mostly present within or close to the Ca1, Ca2, Sa, or Sb. Of note, the ferret sera generated against WT CA/04 were unable to neutralize the triple mutant D131 E-S193T-A198S. The serological data of the 189 mutants was then integrated with archived public data for seasonal H1 N1 (1977-2009) and 2009 H1 N1 viruses (2009-2016) and applied MAIVeSS to identify residues associated with the antigenicity. Results showed that 30 residues were associated with the antigenicity of H1 N1 IAV (
The substitutions were then assessed for how they affected virus yield in both cells and eggs by measuring the TCID50 titers for each mutant. 14 HYcell mutants were identified that showed at least a 10-fold increase in virus yield compared to WT, as well as 29 LYce11 mutants that showed at least a 10-fold decrease (
The highest yield was observed in the N159D-K1661 mutant, with a yield of 1.52×107 TCID50/mL, which was about 100-fold higher than WT. Additionally, 33 HYegg mutants and 19 LYegg mutants were identified when compared to WT. The D131E-S193T-A198S, N159D-K1661, and 1169F-D225G mutants had the highest titers in eggs, and were approximately 800-fold higher than WT. Of note, these three mutants also exhibited high-yield traits in cells, thus designated as HYboth. MAIVeSS was utilized to identify substitutions at 38 residues that were associated with virus yield. The majority of these residues were located on the surface of the HA trimer and in close proximity to the RBS pocket but distant from the pocket center (
Diversified glycan binding facilitates virus replication in cells and eggs. To investigate if high-yield trait correlates with glycan substructure binding properties, the receptor-binding properties of the 189 mutant viruses were analyzed using glycan microarrays comprising 75 glycoforms (
A matrix of 28 glycan substructure features was further used to group the glycans based on their internal and terminal substructures as well as their linkers (
By employing biolayer interferometry analyses (BLI) for glycan binding profiling, the broadened binding specificity of the HYboth mutant D131 E-S193T-A198S was confirmed. Specifically, it was demonstrated that this mutant not only binds to 6′SLN, but also to 3′SLN and sLeX (
MAIVeSS was used to identify the glycan substructures associated with yield traits in cells and eggs and the amino acid substitutions associated with binding preference to these glycan substructures. Analysis revealed several glycan terminal substructures that were significantly associated with high-yield traits, including 6′SLN, 3′SLN, sLeX, and Neu5Gcα2-6Galβ1-4GlcNAc. Additionally, it was found that certain internal substructures, such as core lactose, GlcNAcb1-2, and Galα1-4Galβ1-4GlcNAc, had a significant impact on glycan binding.
A subset of antigenically matched A(H1N1)pdm09 epidemic viruses were high-yield in both cells and eggs. MAIVeSS was used to assess both yield and antigenic properties for A(H1N1)pdm09 (2009-2020, n=11,424) in eggs and cells, in comparison to WT CA/04. Using the antigenic distance matrix generated by MAIVeSS, a sequence-based antigenic cartography was used, which revealed two antigenic clusters, CA/09 and WI/19 (
Using MAIVeSS as the prediction tool, a total of 155 virus variants were identified as potential high-yield strains in eggs, 433 in cells, and 761 for both. Among those high-yield strains for both eggs and cells, 294 were CA/09-like viruses (38.6%), while 467 were WI/19-like viruses (61.4%). These high-yield strains were not geographically clustered and were scattered sporadically across the phylogenetic trees, without clear association with any particular lineages (FIG. 3B). However, the number of HYboth strains increased significantly after the emergence of WI/19-like variants (
Multiple amino acid substitutions associated with yield properties were observed in these HYboth strains, but there were no consistent patterns observed across influenza seasons. However, after the 2018-2019 influenza season, viruses with K133aN, N159K/D/S, K166Q, S206T, and/or K214R were more likely to be high-yield strains (
To further validate the model, HA and NA genes for 4 predicted vaccine candidates were synthesized, and subsequently generated 4 reassortant viruses (i.e. rgSP/16, rgCQ/17, rgBRU/19 and rgMAS/20) with PR8 as the backbone and determined their antigenic and yield phenotypes. Antigenically, 2 viruses were shown as CA/04-like and the other 2 as WI/19-like (
Taken together, the findings indicate that the high-yield trait of A(H1N1)pdm09 viruses was sporadically distributed across different antigenic clusters and has become more prevalent since 2018. These experimental results confirm MAIVeSS's ability to identify antigenic matches and high-yield vaccine strains for A(H1 N1)pdm09 viruses.
Diversifying influenza virus glycan binding profile facilitates the acquisition of high-yield properties. It was hypothesized that A(H1 N1)pdm09 acquired high-yield properties by binding to additional sialylated glycan receptors, particularly SA2-3Gal, or by increasing their glycan binding avidity to SA2-6Gal. To test this, BLI was conducted for 6 H1 N1 variants, including low-yield MI/15 and high-yield WI/19, as well as 4 high-yield vaccine candidates predicted by MAIVeSS. Results showed that 3 vaccine candidates rgSP/16, rgCQ/17 and rgMAS/20 bound to both 3′SLN and 6′SLN, whereas MI/15, WI/19, and one vaccine candidate rgBRU/19 bound only to 6′SLN. Furthermore, it was found that rgBRU/19 had a higher binding avidity to 6′SLN than MI/15 (
Among WI/19 and the 4 high-yield vaccine candidates, only about half of the residues linked to yield traits were conserved (
The effect of N159K, K166Q, and S206T on glycan binding affinity was investigated by conducting structural modeling based on the crystal structure of CA/04 HA complexed with 6′SLN and 3′SLN (
In summary, diversity at the HA RBS of A(H1 N1)pdm09 can enhance virus yields in both cell and egg substrates by increasing sialylated glycan binding avidity or diversifying virus binding to different sialylated glycan receptors.
In this Example, MAIVeSS was developed, a machine learning based framework, that can accurately predict both antigenicity and yield phenotypes based on HA protein sequences. The training dataset consisted of a library of 189 mutant viruses generated by epPCR-based reverse genetics targeting residues 126-244 (H3 numbering). It was observed that acquisition of N159K, a key marker for antigenic drift according to the model described herein, led to changes in antigenicity from CA/09 to WI/19, consistent with published reports and facilitated acquisition of the high-yield trait in a significant proportion of A(H1 N1)pdm09 epidemic strains during recent influenza seasons. While the model described here focuses on HA, it is important to note that antigenic drift of neuraminidase (NA) has also been well-documented in H1 N1 and H3N2 influenza viruses.
As such, ongoing efforts are aimed to expand MAIVeSS prediction capacity to include both HA and NA proteins. The glycan profiling analysis conducted on 43 high-yield mutants suggested that diversifying glycan binding profiles could enhance virus replication in both eggs and cells. Specifically, increased binding avidities to SA2-6Gal results in higher virus yield in cells, while broadening glycan binding capabilities to SA2-3Gal or sLeX improves virus yield in eggs. Studies indicate that a small subset of A(H1N1)pdm09 epidemic viruses naturally prefer both SA2-6Gal and SA2-3Gal, allowing them to replicate efficiently in both cells and eggs without adaptation. On the other hand, some high-yield strains (e.g. WI/19) were found to have no significant changes in binding preference to either 3′SLN or 6′SLN (
Both SA2-6Gal and SA2-3Gal receptors are expressed in MDCK cells and chicken embryonated eggs. However, SA2-3Gal receptors are predominantly expressed in eggs while MDCK cells contain a similar amount of SA2-6Gal and SA2-3Gal. In addition to SA2-3Gal and SA2-6Gal, neutral glycans such as high-mannose glycans and glycans terminated with Gal and GalcNAc are also commonly found in eggs. Mass spectrometry analyses showed some glycans in eggs are fucosylated. The CA/04, the prototype A(H1N1)pdm09 virus which showed poor replication in both MDCK cells and eggs, had a strong binding preference for SA2-6Gal and did not bind to SA2-3Gal. In humans, there is no selection pressure for either cell-based or egg-based replication efficiency. Thus, findings suggested that ad hoc substitutions at the HA RBS across A(H1 N1)pdm09 strains likely enabled a subset of these variants to expand their binding preference from SA2-6Gal to both SA2-6Gal and SA2-3Gal, resulting in the acquisition of a high-yield trait. This Example demonstrates that it is possible to select naturally circulating strains as vaccine candidates without the need for further engineering.
In summary, the data from the proof-of-concept experiments in this Example confirmed that MAIVeSS enables rapid selection of antigenically matching and high-yield influenza strains directly from clinical isolates as potential seed viruses to accelerate vaccine production and facilitate timely supply of seasonal vaccines.
To confirm the binding avidities observed in the glycan microarray, biolayer interferometry was used to analyze the binding of the HYboth mutant D131 E-S193T-A198S to five representative glycan analogs: Neu5Acα2-3Galβ1-4GlcNAcβ (3′SLN), Neu5Acα2-6Gal31-4GlcNAc3 (6′SLN), Neu5Acα2-3Gal31-4[Fucα1-3]GlcNAcβ (sLeX), Neu5Gcα2-3Gal31-4GlcNAcβ (3′SLN(Gc)), and Neu5Gcα2-3Galβ1-4[Fucα1-3]GlcNAcp (sLeX(Gc)). The HYboth mutant had broadened binding avidities from 6′SLN to 3′SLN and sLeX whereas WT CA/04 did not bind to 3′SLN and sLeX. The mutant had a 1.61-fold lower binding avidity to 6′SLN than to 3′SLN, and it did not bind to 3′SLN(Gc) or sLeX(Gc), similar to WT CA/04. These results were consistent with those obtained from the glycan microarray.
Viral RNA was isolated from 200 μl of the sample using a 5X MagMAXTM Pathogen RNA/DNA kit (Thermo Fisher Scientific, Pittsburgh, PA) according to the manufacturer's instructions, and a total of 80 μl RNA was obtained. cDNA synthesis was carried out using SuperScript III Reverse Transcriptase (Invitrogen, Grand Island, NY) with 10 μl of the isolated RNA and the influenza virus-specific primer Uni12 (5′-AGCAAAAGCAGG-3′; SEQ ID NO:6)), with a total reaction volume of 25 μl. The HA segment of mutants was amplified using the Phusion High-Fidelity PCR Kit (Thermo Fisher Scientific, Pittsburgh, PA) and the primers CA/04_HA_F (5′-ATGAAGGCAATACTAGTAGTTCTGC-3′; SEQ ID NO:7) and CA/04_HA_R (5′-TTAAATACATATTCTACACTGTAGAGACC-3′; SEQ ID NO:8). The PCR products (50 μl) were purified using the GeneJET PCR Purification kit (Thermo Fisher Scientific, Pittsburgh, PA) as per the manufacturer's instructions. The HA sequences of the mutants were confirmed by Sanger sequencing at Eurofins (Louisville, KY) or University of Missouri DNA core.
Ferret antisera were produced in male or female ferrets aged 6 to 8 weeks, which were confirmed to be seronegative for CA/04, A/Switzerland/9715293/2013 (H3N2), and A/Hong Kong/4801/2014(H3N2). Each ferret was intranasally inoculated with 106 TCID50 of either the wild-type virus or a mutant virus to be tested. Ferret sera were collected 21 days after inoculation and used for antigenic phenotyping through serological assays.
To ensure the robustness of the analyses, the performance of the sparse learning model was compared with three other commonly used sparse models: LASSO, RIDGE, and SGL. Additionally, the model was also compared with two other sparse learning methods, the L1- and L2-norm regularized method and the L1- and L∞-norm Composite Absolute Penalties method (iCAP). LASSO uses L1-norm regularization, RIDGE uses L2-norm regularization, SGL uses group Lasso regularization, L1- and L2-norm regularization combines L1-norm and L2-norm regularization, and iCAP combines L1-norm and L∞-norm regularization. The performance of these models was evaluated based on various metrics, such as accuracy, Root Mean Square Error (RMSE), and predictive power. Briefly, the LASSO regression seeks to minimize the following: ∥y−Xw∥2+λ1∥w∥1, the RIDGE regression seeks to minimize the following: ∥y−Xw∥2+λ1∥w∥2, the RIDGE regression seeks to minimize the following: ∥y−Xw∥2+λ1Σt∥wl∥2+λ2∥w∥1, the L1- and L2-norm regularized method seeks to minimize: ∥y−XW∥2+λ1 Σ∥[∥WG
The PIMA (Protein-Protein Interactions in Macromolecular Analysis) method was utilized to incorporate the biochemical properties of amino acids. PIMA assigns the 20 amino acids into nine groups and assigns a different numerical code for different mutations. Substitutions between different pairs of residues are given an inclusive weight between 0 and 5. The weights assigned to each feature in the learning results indicate the significance of the feature, with greater weight indicating higher significance.
To investigate the impact of amino acid substitutions on growth phenotype, a three-group method was utilized for assigning amino acids to different groups based on their biophysical properties. Specifically, each amino acid was classified into one of three groups: nonpolar (including V, L, I, M, C, F, W, and Y), small nonpolar (including G, A, and P), and polar/charged (including S, T, N, Q, H, D, E, K, and R) 6. Using this classification, if a mutation occurred between two groups at a given residue j (e.g., nonpolar to small polar), the j-th element of the feature vector xi was set to 1; otherwise, it was set to 0. This approach allows one to evaluate the directionality of amino acid substitutions on the growth phenotype.
The regularization parameters of the sparse learning model were optimized using root mean square error (RMSE). The choice of regularization methods (LASSO, RIDGE, SGL, L1- and L2-norm, or iCAP) and the scoring method were also based on RMSE, determined through 10-fold cross-validation. In this method, 90% of the data were used for training and 10% for testing, and the model's performance was evaluated based on the RMSE, with smaller values indicating better performance. RMSE was defined as:
To evaluate the performance of the method, a comparison based on accuracy was performed. Specifically, a threshold of 4-fold (2 units of antigenic distance) was considered to determine if two viruses were antigenically distinct and exhibited antigenic drift. Using this threshold, classification tasks were defined to measure the prediction accuracy. The accuracy metric describes the proportion of correctly predicted results among the total number of samples.
To evaluate the reliability of the selected features by MTL-GGSL, 100 independent experiments were conducted with 80% of the training data in each experiment. Only features with a bootstrap value of at least 80 across multiple tasks were retained, resulting in a set of unique features that were chosen as the final features learned by MTL-GGSL.
Over the past few years, several computational models have been developed to identify influenza antigenic variants using genomic sequences. These models include sparse learning, bivariate correlation analysis, Bayesian model, naïve Bayes classifier, random forest, regression models, decision tree algorithms, and convolutional neural network model. Among these models, sparse learning has proven to be efficient and generalizable in identifying the association between residues and antigenicity of multiple subtypes of IAVs, including H1 N1, H3N2, and H5N1. Additionally, generalized hierarchical sparse models have been used to identify the synergistic effects of multiple amino acid substitutions on antigenic changes. To overcome the challenges in data integration, multi-task machine learning was developed, which assigns datasets to individual tasks and considers the relationship between different tasks. In another study, group Least Absolute Shrinkage and Selection Operator (LASSO) was developed to accommodate multiple types of features and explore the relationships between different feature groups. Although these models have proven effective in identifying antigenic variants, none of them have considered virus yield. Therefore, they cannot be used to directly identify antigenic match and high-yield viruses that can be produced readily based on genetic sequences.
In view of the above, it will be seen that the several objects and advantages of the present disclosure have been achieved and other advantageous results have been obtained.
As various changes could be made in the above constructions without departing from the scope of the invention, it is intended that all matter contained in the above description or shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.
The present application claims priority benefits from U.S. provisional patent application Ser. No. 63/578,043 filed Aug. 22, 2023, which is hereby incorporated by reference in its entirety.
This invention was made with government support under A1147640, A1114433, and A1116744 awarded by the National Institutes of Health. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63578043 | Aug 2023 | US |