The disclosure is generally directed to a method of assessing viral strains such as influenza, and in particular, predicting emergent strains and identifying possible pandemic strains.
This application contains a Sequence Listing which has been submitted electronically in XML format. The Sequence Listing XML is incorporated herein by reference. Said XML file, created on Jan. 13, 2025, is named UCT-01601_SL.xml and is 18,269 bytes in size.
Animal influenza viruses emerging into humans have triggered devastating pandemics in the past. Yet, the ability to evaluate the pandemic potential of individual strains that do not yet circulate in humans, remains limited.
Accordingly, there is a need for systems and methods to computationally learn how new variants emerge, shaped by evolutionary constraints using only observed genomic sequences of key viral proteins.
According to certain aspects of the present disclosure, devices are disclosed for the ergonomic design of a surgical tool.
In one embodiment, a method comprises reading a genetic sequence of a first strain of a virus; identifying a plurality of residue indices in the genetic sequence; for each of the plurality of indices, assigning a predictor, the predictor configured to predict a residue for its assigned index based upon a residue of at least one other index, the predictors thereby forming a network; determining, based on the network of predictors, a probability of transition of the first strain to a second strain.
In some embodiments, the method further comprises determining, based on the probability of transition, that the second strain is above a dominant strain threshold; and providing a forecast of the dominant strain for a subsequent season.
In some embodiments, the dominant strain is a strain with a maximized probability of simultaneously arising from a set of currently circulating strains.
In some embodiments, the first strain is an animal strain of the virus and the second strain is a human strain of the virus.
In some embodiments, the method further comprises determining a pandemic potential of the first strain based upon the probability of transition of the first strain to a second strain.
In some embodiments, the method further comprises determining a divergence of each residue index of the plurality of residue indices; averaging the divergence over the plurality of residue indices in the genetic sequence; determining, based on the average of the divergence, a distance metric corresponding to the genetic sequence.
In some embodiments, the method further comprises based on the probability of transition, determining a risk score of the second strain, wherein the risk score indicates a risk of transmission to a human.
In some embodiments, the risk score is an average log likelihood of the probability of transition.
In some embodiments, the method further comprises reading a library of strains of the virus; determining a distance value between the second strain and each of the library of strains; and determining, from the distance values, a high-risk subset of the library.
In some embodiments, the virus is an influenza.
In some embodiments, the sequence is a genomic sequence.
In some embodiments, each of the residue indices corresponds to one or more surface proteins.
In some embodiments, each predictor comprises a conditional inference tree, a regression tree, and/or a decision tree.
In some embodiments, determining the probability of transition comprises: inferring a variation of a mutational probability of the sequence and potential residue replacements between the plurality of residue indices; and determining a probability of a spontaneous transition based on the inferred variation and the potential residue replacement.
In some embodiments, the probability of transition is a conditional distribution.
In some embodiments, the second strain comprises a human strain of a same subtype of the first strain, wherein the first strain is associated with a first season and the second strain is associated with a second season subsequent to the first season.
In some embodiments, the method further comprises searching a plurality of seasons prior to the first season for one or more historical strains of the same subtype.
In some embodiments, the method further comprises recomputing the probability of transition for each season.
In an alternative embodiment, a system comprises a datastore; a computing node comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of the computing node to cause the processor to perform a method comprising: reading a genetic sequence of a first strain of a virus from the datastore; identifying a plurality of residue indices in the genetic sequence; for each of the plurality of indices, assigning a predictor, the predictor configured to predict a residue for its assigned index based upon a residue of at least one other index, the predictors thereby forming a network; determining, based on the network of predictors, a probability of transition of the first strain to a second strain.
In an alternative embodiment, computer program product for determining a probability of transition of a virus, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to perform a method comprising: reading a genetic sequence of a first strain of a virus; identifying a plurality of residue indices in the genetic sequence; for each of the plurality of indices, assigning a predictor, the predictor configured to predict a residue for its assigned index based upon a residue of at least one other index, the predictors thereby forming a network; determining, based on the network of predictors, a probability of transition of the first strain to a second strain.
The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate various exemplary embodiments and together with the description, serve to explain the principles of the disclosed embodiments.
Reference will now be made in detail to the exemplary embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
The systems, devices, and methods disclosed herein are described in detail by way of examples and with reference to the figures. The examples discussed herein are examples only and are provided to assist in the explanation of the apparatuses, devices, systems, and methods described herein. None of the features or components shown in the drawings or discussed below should be taken as mandatory for any specific implementation of any of these devices, systems, or methods unless specifically designated as mandatory.
Also, for any methods described, regardless of whether the method is described in conjunction with a flow diagram, it should be understood that unless otherwise specified or required by context, any explicit or implicit ordering of steps performed in the execution of a method does not imply that those steps must be performed in the order presented but instead may be performed in a different order or in parallel.
As used herein, the term “exemplary” is used in the sense of “example,” rather than “ideal.” Moreover, the terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of one or more of the referenced items.
Influenza viruses constantly evolve, sufficiently altering surface protein structures to evade the prevailing host immunity, and cause the recurring seasonal epidemic. These periodic infection peaks claim a quarter to half a million lives globally, and the current response hinges on annually inoculating the human population with a reformulated vaccine. Among numerous factors that hinder optimal design of the flu shot, failing to correctly predict the future dominant strain dramatically reduces vaccine effectiveness. Despite recent advances, such predictions remain imperfect. In addition to the seasonal epidemic, influenza strains spilling over into humans from animal reservoirs have triggered pandemics at least four times (1918 Spanish flu/H1N1, 1957 Asian flu/H2N2, 1968 Hong Kong flu/H3N2, 2009 swine flu/H1N1) in the past 100 years. With the memory of the sudden SARS-CoV-2 emergence fresh in minds, a looming question is whether such events can be preempted and mitigated in the future. Influenza A, partly on account of its segmented genome and its wide prevalence in common animal hosts, can easily incorporate genes from multiple strains and (re)emerge as novel human pathogens, thus harboring a high pandemic potential.
One possible approach to mitigating such risk is to identify animal strains that do not yet circulate in humans, but are likely to spill-over and quickly achieve human-to-human (HH) transmission capability. While global surveillance efforts collect wild specimens from diverse hosts and geo-locations annually, the ability to objectively, reliably and scalably risk-rank individual strains remains limited, despite some recent progress.
The Center for Disease Control's (CDC) current solution to this problem is the Influenza Risk Assessment Tool (IRAT). Subject matter experts (SME) score strains based on the number of human infections, infection and transmission in laboratory animals, receptor binding characteristics, population immunity, genomic analysis, antigenic relatedness, global prevalence, pathogenesis, and treatment options, which are averaged to obtain two scores (between 1 and 10) that estimate 1) the emergence risk and 2) the potential public health impact on sustained transmission. IRAT scores are potentially subjective, and depend on multiple experimental assays, possibly taking weeks to compile for a single strain. This results in a scalability bottleneck, particularly with thousands of strains being sequenced annually.
Here, a pattern recognition algorithm to automatically parse out emergent evolutionary constraints operating on Influenza A viruses in the wild is introduced, to provide a less-heuristic, theory-backed scalable solution to emergence prediction. This approach is centered around numerically estimating the probability Pr(x→y) of a strain x spontaneously giving rise to y. It is shown that this capability is key to preempting strains which are expected to be in future circulation, and 1) reliably forecast dominant strains of seasonal epidemics, and 2) approximate IRAT scores of non-human strains without experimental assays or SME scoring.
In some embodiments, 405,778 Haemagglutinnin (HA) and Neuraminidase (NA) sequences from public databases are analyzed. The likelihood of specific future mutations is estimated, as well as the numerical odds of specific descendants arising via natural processes. After validating a model to forecast the dominant strain(s) for seasonal flu, with Emergenet-based forecasts significantly outperforming WHO recommendations consistently over the past two decades for H1N1/H3N2 subtypes, individually in the Northern/Southern hemispheres (average match-improvement 52.83% over two decades, 153.32% over the last decade, and 159.70% over the pre-COVID-19 five year period for H1N1 HA), the pandemic potential of animal strains not yet known to transmit in humans is assessed. While the state-of-the-art Influenza Risk Assessment Tool (IRAT) from the CDC to assess such risk includes time-consuming experimental assays, embodiments of the present disclosure include calculations taking ˜6 sec/strain, strongly correlating with published IRAT scores (correlation: 0:707, p-value: 0:00024). This six-orders-of-magnitude speedup is necessary to exploit current surveillance capacity, and analyze thousands of strains collected annually. Considering 6,066 wild Influenza A animal viruses sequenced post-2020, risky strains of diverse subtypes, hosts and geo-locations are identified, with six having estimated emergence scores >6:5. Such scalable risk-ranking can enable preemptive pandemic mitigation, including the targeted inoculation of animal hosts before the first human infection, and outline new public health measures that are potentially effective notwithstanding possible vaccine hesitancy in humans.
To uncover relevant evolutionary constraints, variations (point substitutions and indels) of the residue sequences of key proteins implicated in cellular entry and exit, were analyzed, namely HA and NA respectively. By representing these constraints within a predictive framework—the Emergenet (Enet)—the odds of a specific mutation to arise in the future were estimated, and consequently the probability of a specific strain spontaneously evolving into another (
Such explicit calculations are difficult without first inferring the variation of mutational probabilities and the potential residue replacements from one positional index to the next along the protein sequence. The many well-known classical DNA substitution models or standard phylogeny inference tools which assume a constant species-wise mutational characteristics, are not applicable here. Similarly, newer algorithms such as FluLeap which identifies host tropism from sequence data, or estimation of species-level risk do not allow for strain-specific assessment.
The dependencies uncovered are shaped by a functional necessity of conserving/augmenting fitness. Strains must be sufficiently common to be recorded, implying that the sequences from public databases that are trained with have high replicative fitness. Lacking kinetic proofreading, Influenza A integrates faulty nucleotides at a relatively high rate (10−3-10−4) during replication. However, few variations are actually viable, leading to emergent dependencies between such mutations. Furthermore, these fitness constraints are not time-invariant. The background strain distribution, and selection pressure from the evolution of cytotoxic T lymphocyte epitopes in humans can change quickly. With a sufficient number of unique samples to train on for each flu season, the Emergenet (recomputed for each time-period) is expected to automatically factor in the evolving host immunity, and the current background environment.
Structurally, an Emergenet comprises an interdependent collection of local predictors, each aiming to predict the residue at a particular index using as features the residues at other indices (
An Emergenet comprises almost as many such position-specific predictors as the length of the sequence. These individual predictors are implemented as conditional inference trees11, in which nodal splits have a minimum pre-specified significance in differentiating the child nodes. Thus, each predictor yields an estimated conditional residue distribution at each index. The set of residues acting as features in each predictor are automatically identified, e.g., in the fragment of the H1N1 HA Emergenet (2020-2021,
Structurally, an Emergenet comprises an interdependent collection of local predictors, each aiming to predict the residue at a particular index using as features the residues at other indices (
In a first application of predicting future dominant strains, the H1N1 and H3N2 HA and NA sequences from Influenza A strains in the public NCBI and GISAID databases recorded between 2000-2022 (393; 189 in total, Supplementary Table 1). Emergenets are separately constructed for H1N1 and H3N2 subtypes, and for each flu season using HA sequences, yielding 84 models in total for predicting seasonal dominance. Using only sequence data is advantageous since deeper antigenic characterization tends to be substantially low-throughput compared to genome sequencing. However, deep mutational scanning (DMS) assays have been shown to improve seasonal prediction 6. Despite limiting to only genotypic information (and subtypes), the approach distills emergent fitness-preserving constraints that outperform reported DMS-augmented strategies.
Inference of the Emergenet predictors is the first step, which then induces an intrinsic distance metric between strains. The E-distance (i.e., Emergenet distance) (Eq. (6) in Methods) is defined as the square-root of the Jensen-Shannon (JS) divergence of the conditional residue distributions, averaged over the sequence. Unlike the classical approach of measuring the number of edits between sequences, the E-distance is informed by the Emergenet-inferred dependencies, and adapts to the specific subtype, allele frequencies, and environmental variations. Central to the approach is the theoretical result (Theorem 1 in Methods) that the E-distance approximates the log-likelihood of spontaneous change, i.e., log Pr(x→y). With E-distance, the notion of “loci” is less important, as the distances between entire sequences are computed. This is a more general approach, since the use of loci ignores epistatic effects and cross-talk between non-collocated mutations. However, a SHAP analysis has been added to identify which loci are contributing most to the distance calculation, and it emerges that the RBD sites are indeed driving the distance differences between strains in the context of seasonal strain calculations. Note that despite general correlation between E-distance and edit-distance, the E-distance between fixed strains can change if only the background environment changes (Supplementary Table 2, 3). Thus, a new model using current data is learned, and only predicted for the near future. In in-silico experiments, it was found that while random mutations to genomic sequences produce rapidly diverging sets, Emergenet-constrained replacements produce sequences that are verifiably meaningful (In-silico Corroboration of Emergenet's Capability To Capture Biologically Meaningful Structure, Methods and
It is true that a priori, the space of jumps from one sequence x to another sequence y is vast, and learning the exact model of jump from one strain to another is infeasible from the available data. However, the set of rules or constraints that best describe the data allows for the dramatic reduction of the set of future possibilities in a probabilistic sense. More precisely, each positional index in the HA primary structure is viewed individually as a target variable, and using all other indices as features to obtain a classification model, resulting in ˜566 conditional inference trees, as described above. Using conditional inference trees for the classification models implies it is guaranteed that only statistically significant splits are allowed in the inferred decision tree. When no such splits are found, no classification model is returned for that index. Finally, this forest of trees collectively comprises one Emergenet model, which induces the E-distance metric proven (Theorem 1 in Methods) to scale with the log-likelihood of the transition or jump probability. Determining the numerical odds of a spontaneous jump Pr(x→y) (
A dominant strain for an upcoming season may be identified as one which maximizes the joint probability of simultaneously arising from each (or most) of the currently circulating strains (
where x*t+δ is a predicted dominant strain at time t+δ, Ht is the set of currently circulating human strains at time t observed over the past year, θ[t] is the E-distance informed by the inferred Emergenet using sequences in Ht, wy is the estimated probability of strain y being generated by the Emergenet, and A is a constant dependent on the sequence length and significance threshold used. The first term gets the solution close to the centroid of the current strain distribution (in the E-distance metric, and not the standard edit distance), and the second term relates to how common the genomic patterns are amongst recent human strains.
Prediction of the future dominant strain as a close match to a historical strain allows out-of-sample validation against past World Health Organization (WHO) recommendations for the flu shot, which is reformulated February and September of each year for the northern and southern hemispheres, respectively, based on a cocktail of historical strains determined via global surveillance. For each year of the past two decades, a separate Emergenet is constructed for the southern and northern flu seasons using HA strains available from the previous season (up to February for the north, and September for the south). For seasons with >3000 strains available, 3000 strains were randomly sampled to provide an accurate representation of the strain population. As shown in Supplementary Table 8 and
To validate and compare recommendations, a representative sample of strains is needed in the next-year population. For each season, the MeanShift algorithm is again used to cluster the strain population using an edit distance matrix for both HA and NA segments separately. Each cluster Pit is represented by a dominant strain {circumflex over (x)}dom-t, defined as the strain closest to the centroid in the edit distance metric (L(x,y)) (Eq. (2)). The error is computed as the weighted average minimum edit distance between the two recommendations and the dominant strains, where the weight of the dominant strain {circumflex over (x)}dom-t is the proportion of its cluster to the population, i.e., wit=|Pit|/|Ht|.
The Emergenet-informed forecasts outperform WHO/CDC recommended flu vaccine compositions consistently over the past two decades, for both H1N1 and H3N2 subtypes, individually in the northern and the southern hemispheres (which have distinct recommendations). For H1N1 HA, the Emergenet recommendation outperforms WHO by 52.83% on average over the last two decades, by 153.32% on average in the last decade, and by 159.70% in the period 2015-2019 (5 years pre-COVID-19). The gains for H1N1 NA over the same time periods are 34.48%, 86.46%, and 97.34% respectively. For H3N2 HA, the Emergenet recommendation outperforms WHO by 70.08% on average over the last two decades, by 45.86% on average in the last decade, and by 53.06% in the period 2015-2019. The gains for H3N2 NA over the same time periods are 52.13%, 55.44%, and 55.88%, respectively (Extended Data Table 2). Detailed predictions, along with historical strains closes to the observed dominant one are tabulated in Extended Data Tables 3 through 6. Visually,
The performance of the recommendation was also analyzed against two random strains selected from the last year's population. The two random strains from each season are random “predictions,” choosing from strains circulating in the past one year leading up to vaccine selection, and computing the errors using the same weighted average minimum edit distance method as used for the recommendations. This is repeated 20 times for each season and the upper bound of the 95% confidence interval is taken to account for the large variance in error as the worst case. If E is the set of 20 errors, this is
For H1N1 HA, the Emergenet recommendation outperforms the random recommendation by 34.67% on average over the last two decades and by 68.07% on average in the last decade. The gains for H1N1 NA over the same time periods are 23.23% and 29.84%, respectively. For H3N2 HA, the Emergenet recommendation outperforms WHO by 66.82% on average over the last two decades and by 67.95% on average in the last decade. The gains for H3N2 NA over the same time periods are 36.92% and 34.05%, respectively (Extended Data Table 7).
Comparing the Emergenet inferred strain (ENT) against the one recommended by the WHO, it is found that the residues that only the Emergenet recommendation matches correctly with dominant strain (DOM), while the WHO recommendation fails, are largely localized within the RBD, with >57% occurring within the RBD on average (
Some embodiments, however, have the ability to estimate the pandemic potential of novel animal strains, via a time-varying E-risk score ρt(x) for a strain x not yet found to circulate in human hosts. It is shown that (Measure of Pandemic Potential, Methods):
scales as the average log-likelihood of Pr(x→y) where y is any human strain of a similar subtype to x, and θ[t] is the E-distance informed by the Emergenet computed from recent human strains Ht at time t of the same subtype as x, observed over the past year. As before, the Emergenet inference makes it possible to estimate ρt(x) explicitly.
To validate the score against CDC-estimated IRAT emergence scores, Emergenet models were constructed for HA and NA sequences using subtype-specific human strains, typically collected within the year prior to the assessment date, e.g., the assessment date for A/swine/Shandong/1207/2016 is 06/2020, and human H1N1 strains collected between Jan. 7, 2019 and Jun. 30, 2020 were used for the Emergenet inference. For sub-types with less recorded human strains (H1N2, H7N7), all subtype-specific human strains collected up to the assessment date were considered to infer the Emergenet. For subtypes with very few or no recorded human strains even without a lower date bound (H5N2, H5N6, H5N8, H7N8, H9N2, H10N8), the Emergenet was constructed using all human strains that match the HA subtype, e.g., H5Nx for H5N2, H5N6, and H5N8. This addresses the general concern that Emergenet may not be able assess the threat posed by the viruses that have yet to be detected in sufficient numbers; the strains for which this method (marked by ** in Supplementary Table 1) was fit along the fit line in
Considering IRAT emergence scores of 22 strains published by the CDC, strong out-of-sample support (correlation of 0.707, pvalue <0.00024,
To map the Emergenet distances to more recognizable IRAT scores, a general linear model (GLM) was trained from the HA/NA-based E-risk values (Multivariate Regression to Identify Map from E-distance to Esti-mated IRAT scores, Methods and Supplementary Table 4). Since the CDC-estimated IRAT impact scores are strongly correlated with their IRAT emergence scores (correlation of 0.8015), a separate GLM to was also trained to estimate the impact score from the E-risk values (Supplementary Table 5). Finally, the IRAT scores were estimated for all 6,066 Influenza A strains sequenced globally between January 2020 through September 2022. HA and NA Emergenet models were trained for each subtype using recent sequence data, computed E-risk using the same method as described above, and identified the ones posing maximal risk (
Subtypes of the risky strains are overwhelmingly H1N1, followed by H3N2, with a small number of H7N9 and H9N2. Five maximally risky strains with emergence score >6.58 are identified to be: A/swine/Missouri/A02524711/2020 (H1N1), A/Camel/Inner Mongolia/XL/2020 (H7N9), A/swine/Indiana/A02524710/2020 (H3N2), A/swine/North Carolina/A02479173/2020 (H1N1), and A/swine/Tennessee/A02524414/2022 (H1N1). Additionally, A/mink/China/chick embryo/2020 (H9N2), with a lower estimated emergence score (6.26) is also important, as the most risky H9N2 strain in the analysis. The HA sequences are compared along with two dominant human strains in 2021-2022 season (
Swine are known to be efficient mixing vessels, and hence unsurprisingly host a large fraction of the risky strains (>80% over 6.0, to over 50% over 6.5). Also, as expected, most of these swine strains are of H1N1 subtype, with the other subtypes having emerged into humans more recently. The finding that a H7N9 poses substantial risk is likewise not surprising: HH transmission has been suspected in Asian-lineage H7N9 strains, and are rated by IRAT as having the greatest potential to cause a pandemic. The finding of the most risky H9N2 strain in a mink is also unsurprising, in the light of these hosts been recently suggested as efficient mixing vessels to breed human-compatible strains. Thus, qualitatively the results are well aligned with the current expectations; nevertheless the ability to quantitatively rank specific strains which pose maximal risk is a crucial new capability enabling proactive pandemic mitigation efforts.
While numerous tools exist for ad hoc quantification of genomic similarity, higher similarity between strains in these frameworks is not sufficient to imply a high likelihood of a jump. The Emergenet algorithm is the first of its kind to learn an appropriate biologically meaningful comparison metric from data, without assuming any model of DNA or amino acid substitution, or a genealogical tree a priori. While the effect of the environment and selection cannot be inferred from a single sequence, an entire database of observed strains, processed through the right lens, can parse out useful predictive models of these complex interactions. These results are aligned with recent studies demonstrating effective predictability of future mutations for different organisms.
The E-distance calculation is currently limited to analogous sequences (such as point variations of the same protein from different viral subtypes), and the Emergenet inference requires a sufficient diversity of observed strains. A multi-variate regression analysis indicates that the most important factor for the approach to succeed is the diversity of the sequence dataset (General linear model for evaluating effect of data diversity on Emergenet performance, Methods and Supplementary Table 6), which would exclude applicability to completely novel pathogens with no related human variants, and ones that evolve very slowly. Nevertheless, the tools reported here can improve effectiveness of the annual flu shot, and perhaps allow for the development of preemptive vaccines to target risky animal strains before the first human infection in the next pandemic. Apart from outlining new precision public health measures to avert pandemics, such strategies might also help to non-controversially counter the impact of vaccine hesitancy which has interfered with optimal pandemic response in recent times.
It is not assumed that the mutational variations at the individual indices of a genomic sequence are independent (See
Consider a set of random variables X={Xi} with i∈{1, . . . , N}, each taking value from the respective sets Σi. Here, each Xi is the random variable modeling the “outcome” i.e. the AA residue at the ith index of the protein sequence. A sample x∈Π1NΣi is an ordered N-tuple, which is a specific strain in this context, consisting of a realization of each of the variables Xi with the ith entry xi being the realization of random variable Xi.
The notation x−i and xi,σ is used to denote:
Also, (S) denotes the set of probability measures on a set S, e.g.,
(Σi) is the set of distributions on Σi. It is noted that X defines a random field over the index set {1, . . . , N}.
Definition 1 (Emergenet). For a random field X={Xi} indexed by i∈{1, . . . , N}, the Emergenet is defined to be the set of predictors ϕ={ϕi}, i.e., the following applies:
where for a sequence x, ϕi(x−1) estimates the distribution of Xi on the set Σi.
Conditional inference trees are used as models for predictors, although more general models are possible.
The mathematical form of the metric is not arbitrary; JS divergence is a symmetricised version of the more common KL divergence between distributions, and among different possibilities, the E-distance is the simplest metric such that the likelihood of a spontaneous jump (See Eq. (9) in Methods) is provably bounded above and below by simple exponential functions of the E-distance.
Definition 2 (E-distance: adaptive biologically meaningful dissimilarity between sequences). Given two sequences x, y∈Π1NΣi, such that when x, y are drawn to from the populations P,Q inducing the Emergenet ϕP, ϕQ, respectively, a pseudo-metric θ(x,y) is defined as follows:
where (⋅,⋅) is the Jensen-Shannon divergence and Ei indicates expectation over the indices.
The square root in the definition arises naturally from the bounds that are able to be proven, and is dictated by the form of Pinsker's inequality, ensuring that the sum of the length of successive path fragments equates the length of the path.
For modeling to be reliable, a quantitative test of how well the Emergenet represents the data is needed. Here, an explicit membership test is formulated to ascertain if individual samples may indeed be generated by the Emergenet with sufficiently high probability.
Definition 3 (Membership probability of a sequence). Given a population P inducing the Emergenet ϕP and a sequence x, one can compute the membership probability of x:
where xj is the jth entry in x, and is thus an element in the set Σj. Since a main concern is the case where Σj is a finite set, ϕjP(x−j)|xj is the entry in the probability mass function corresponding to the element of Σj which appears at the jth entry in sequence x.
This calculation can be carried out for a sequence x known to be in the population P as well, which allows us to define the membership degree ωxP.
Definition 4 (Membership degree). Let X be a random field representing a population P, i.e., X=x is a randomly drawn sequence from P. Then the membership degree ωP is a function of the random variable X:
Note that ωP takes values in the unit interval [0,1] and the probability x is a member of the population P is ωP(X=x), denoted as ωxP or ωx if P is clear from context.
Since ωP(X) is a random variable, sets of sequences can now be computed that better represent the population P, and ones that are on the fringe. Whether a particular sequence is not from the population P can now be evaluated using a pre-specified significance level.
The Emergenet framework allows for the rigorous computation of bounds on the probability of a spontaneous change of one strain to another, brought about by chance mutations. While any sequence of mutations is equally likely, the “fitness” of the resultant strain, or the probability that it will even result in a viable strain, or not. Thus, the necessity of preserving function dictates that not all random changes are viable, and the probability of observing some trajectories through the sequence space are far greater than others. The Emergenet framework allows for the exploration of this constrained dynamics, as revealed by a sufficiently large set of genomic sequences.
The mathematical intuition relating E-distance to the log-likelihood of spontaneous change is similar to quantifying the odds of a rare biased outcome when tossing a fair coin. While for an unbiased coin, the odds of roughly 50% heads are overwhelmingly likely, large deviations do happen rarely, and it turns out that the probability of such rare deviations can be explicitly quantified with existing statistical theory. Generalizing to non-uniform conditional distributions inferred by the Emergenet, the likelihood of a spontaneous transition by random chance may also be similarly bounded.
It is shown in Theorem 1 in the supplementary text (Eq. (9)) that at a significance level α, with a sequence length N, the probability of spontaneous jump of sequence x from population P to sequence y in population Q, Pr(x→y), is bounded by
where ωyQ is the membership probability of strain y in the target population, N is the sequence length, and α is the statistical significance level.
Analyzing the distribution of sequences observed to circulate in the human population at the present time allows for the forecast of dominant strain(s) in the next flu season as follows:
Let x*t+δ be a dominant strain in the upcoming flu season at time t+δ, where Ht is the set of observed strains presently in circulation in the human population (at time t). It is assumed that the Emergenet is constructed using the sequences in the set Ht, and remains unchanged up to t+δ. Since this set is a function of time, the inferred Emergenet also changes with time, and the induced E-distance is denoted as θ[t].
From the RHS bound established in Theorem 1 (See Eq. (9) above) in the supplementary text, the following applies:
where
where N is the equated significance level. Since minimizing the LHS maximizes the lower bound on the probability of the observed strains simultaneously giving rise to xt+δ, a dominant strain x*t+δ may be estimated as a solution to the optimization problem:
In Eq. (13), the assumption that Ht is a single cluster of strains is made. In practice, this may not be the case; it is possible for several distinct clusters to arise in each season. The strains can be clustered by using a clustering method (such as MeanShift) on the E-distance matrix computed between strains in Ht such that there are n disjoint clusters H1t, H2t, . . . Hnt, with ∪i=1nHit=Ht. Thus, multiple predictions can be made per season, by replacing Ht in Eq. (13) with Hit for i=1, 2, . . . n. In some embodiments, the predictions yielded from the largest to clusters are taken.
Measure of Pandemic Potential (Bottom of p11)
The potential of an animal strain xαt to spillover and become HH capable as a human strain xht+δ is measured via the proposed E-risk defined as follows:
whereas before Ht is the set of human strains observed recently (taken as strains collected within the past year), and θ[t] is the E-distance induced by the Emergenet computed from the sequences in Ht.
The intuition here is that a lower bound of ρ(xαt) scales as average log-likelihood of the xαt giving rise to a human strain in circulation at time t. Since the strains in Ht are already HH capable, a high average likelihood of producing a similar strain has a high potential of being a HH cable novel variant, which is a necessary condition of a pandemic strain. To establish the lower bound, from Theorem 1 (See Eq. (9) above) in the supplementary text:
A ln Πy∈H
By noting that A, C, are not functions of xαt, it can be concluded that a lower bound of the proposed risk measure ρ(⋅) scales with the average loglikelihood of producing strains close to a circulating human strain at the current time.
Theorem 1 (Probability bound). Given a sequence x of length N that transitions to a strain y∈Q, the following bounds are found at significance level α.
where ωyQ is the membership probability of strain y in the target population Q (see Def. 3), and θ(x,y) is the q-distance between x,y (see Def. 2).
Proof. Using Sanov's theorem on large deviations, it can be concluded that the probability of spontaneous jump from strain x∈P to strain y∈Q with the possibility P≠Q, is given by:
Writing the factors on the right hand side as:
it is noted that ϕiP(x−i),ϕiQ(y−i) are distributions on the same index i, and hence:
Using a standard refinement of Pinsker's inequality, and the relationship of Jensen-Shannon divergence with total variation:
where a0 is the smallest non-zero probability value of generating the entry at any index. This parameter is related to the statistical significance of the bounds. First, the lower bound can be formulated as:
Similarly, the upper bound may be derived as:
Eqs. (23) and (24) can be combined to conclude:
Now, interpreting α0 as the probability of generating an unlikely event below the desired threshold (i.e., a “failure”), it can be noted that the probability of generating at least one such event is given by 1−(1−a0)N. Hence, if α is the pre-specified significance level, for N>>1:
Hence, it is concluded that at significance level≥α, the bounds are:
This bound can be rewritten in terms of the log-likelihood of the spontaneous jump and constants independent of the initial sequence x as:
where the constants are given by:
The results of simulated mutational perturbations are compared to sequences from databases (for which Emergenets are constructed), and then used NCBI BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) to identify if the perturbed sequences match with existing sequences in the databases (
The key factors that contribute to modeling a set of strains well within the Emergenet framework were investigated. A multivariate regression was carried out with data diversity, the complexity of inferred Emergenet and the edit distance of the WHO recommendation from the dominant strain as independent variables (See Supplementary Table 6 for definitions). Here data diversity is defined as the number of clusters in the input set of sequences, such that any two sequences five or less mutations apart are in the same cluster. Emergenet complexity is measured by the number of decision nodes in the component decision trees of the recursive forest.
Several plausible structures of the regression equation are selected, and in each case it is concluded that data diversity has the most important and statistically significant contribution (Supplementary Table 6).
Multivariate Regression to Identify Map from E-Distance to Estimated IRAT Scores
Separate General Linear Models (GLM) are trained to estimate IRAT scores (emergence and impact) with average E-distance of a sequence of interest from a set of human strains, considering HA and NA sequences separately, using the CDC computed IRAT scores as the dependent variables. The geometric mean of the HA and NA based E-distances are also included as potential explanatory variables. A standard Gaussian model family with identity link function is used to keep the model that maps E-distances to the IRAT scores as simple as possible (see Supplementary Table 4).
Working open-source software (requiring Python 3.x) is publicly available at https://pypi.org/project/emergenet/. All inferred Emergenet models inferred is available at https://doi.org/10.5281/zenodo.7387861.
Two public sequence databases are used: 1) National Center for Biotechnology Information (NCBI) virus and 2) GISAID databases. The former is a community portal for viral sequence data, aiming to increase the usability of data archived in various NCBI repositories. GISAID has a more restricted user agreement, and use of GISAID data in an analysis requires acknowledgment of the contributions of both the submitting and the originating laboratories (Corresponding acknowledgment tables are included as supplementary information). A total of 405,778 sequences were collected in analysis (see Supplementary Table 1).
Referring now to
In computing node 10 there is a computer system/server 12, which is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with computer system/server 12 include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and the like.
Computer system/server 12 may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system/server 12 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
As shown in
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, Peripheral Component Interconnect (PCI) bus, Peripheral Component Interconnect Express (PCIe), and Advanced Microcontroller Bus Architecture (AMBA).
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12, and it includes both volatile and non-volatile media, removable and non-removable media.
System memory 28 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. Computer system/server 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 18 by one or more data media interfaces. As will be further depicted and described below, memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the disclosure.
Program/utility 40, having a set (at least one) of program modules 42, may be stored in memory 28 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments as described herein.
Computer system/server 12 may also communicate with one or more external devices 14 such as a keyboard, a pointing device, a display 24, etc.; one or more devices that enable a user to interact with computer system/server 12; and/or any devices (e.g., network card, modem, etc.) that enable computer system/server 12 to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 22. Still yet, computer system/server 12 can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 20. As depicted, network adapter 20 communicates with the other components of computer system/server 12 via bus 18. It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server 12. Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
The present disclosure may be embodied as a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the present disclosure have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
/2011
/
/Washington/
/Washington/
/Illinois/
/2015
/Washington/
/2014
indicates data missing or illegible when filed
indicates data missing or illegible when filed
indicates data missing or illegible when filed
indicates data missing or illegible when filed
Variable:
Type:
Variable:
Type:
indicates data missing or illegible when filed
Variable:
Type:
Variable:
Type:
indicates data missing or illegible when filed
Name
number of
input
WHO
Variable:
Type
Variable:
indicates data missing or illegible when filed
37/2016
saw a substantial increase in
. This strain was also
n as a representative strain in a study by Su
et al. which
.
/2017
N2
started to increase in November and was
.
N3
-Aug. 7, 2017, a
/2017
7 human infections with Asian H7N
viruses were
at least
of these infections
in death.”
of these infections were reported during the fifth epidemic
-Aug. 7, 2017
.
indicates data missing or illegible when filed
Error
Improvement
Error
39
0
78
79
11
97
40
99
99
.155
2
.397
4
2
67
7
46
32
indicates data missing or illegible when filed
This application claims the benefit of priority to U.S. Provisional Application No. 63/535,650, filed Aug. 31, 2023, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63535650 | Aug 2023 | US |