The present disclosure generally relates to bioinformatics and transcriptomics, and in particular, to a system and associated method for gene state inference through Bayesian methods.
Quantitative measurements of RNA dynamics in populations of fixed cells and individual living cells have consistently revealed complex distributions of RNA counts and behavior across space, time, and individual cells, even for clonal cell populations. Broadly, the term “gene expression variability” is invoked to explain these ubiquitous, variable, and complex RNA expression distributions. One of single-cell biology's driving goals is understanding the molecular origin and downstream consequences of gene expression variability. For example, recent work has demonstrated that rare cells, identifiable only by transient fluctuations in RNA content compared to clonal sister cells, can drive drug-resistant cancer or maintain progenitor cells driving development. While experimental methods can identify rare cells, robustly determining gene transcriptional states and their connectivities from discrete RNA counts across cells remains an open problem.
It is with these observations in mind, among others, that various aspects of the present disclosure were conceived and developed.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Corresponding reference characters indicate corresponding elements among the view of the drawings. The headings used in the figures do not limit the scope of the claims.
A system outlined herein for gene network inference includes a processor in communication with a memory, the memory including instructions executable by the processor to: access observation data associated with a gene network including a quantity of RNA for a plurality of cells observed across a plurality of time points; sample a set of probability values associated with observing the observation data for values of each respective parameter of a plurality of parameters through a measurement model (the plurality of parameters including: a set of discrete parameters including one or more gene states observable within the observation data and a success probability of each respective gene state of the one or more gene states being active; and a set of continuous parameters including kinetic rates associated with transitions between the one or more gene states based on a local geometry of the measurement model across the plurality of time points) and jointly infer, based on the set of probability values and the observation data, a set of most probable values of the plurality of parameters.
The memory can include instructions executable by the processor to apply a Gibbs sampling scheme to iteratively sample probability values associated with values of each respective model parameter of the gene network from the measurement model. Further, the memory can include instructions executable by the processor to: sample the success probability for a gene state of the one or more gene states using an adaptive Metropolis-Hastings sampling scheme encapsulated within the Gibbs sampling scheme; iteratively sample probability values associated with the set of continuous parameters using a Hamiltonian Monte Carlo sampling scheme encapsulated within the Gibbs sampling scheme; and apply a Parallel Tempering scheme nested within the Gibbs sampling scheme for iteratively sampling probability values associated with values of each respective model parameter of the gene network from the measurement model.
The measurement model can incorporate a nonparametric Bayesian formulation of a Chemical Master Equation that characterizes probabilistic temporal evolution of the gene network. Further, the nonparametric Bayesian formulation of the Chemical Master Equation can incorporate a load vector that dynamically represents activity or inactivity of the one or more gene states observable within the observation data. The measurement model can also include a joint posterior probability distribution expressive of a probability of observing the observation data given values of the plurality of parameters of the measurement model. The posterior probability distribution can be constructed based on a set of prior probability distributions associated with each respective parameter of the plurality of parameters of the measurement model. Each gene state of the one or more gene states can respectively be expressed as an element of a load vector, wherein a prior probability distribution associated with the load vector includes a Beta-Bernoulli process prior probability distribution and where a value of the element of the load vector representing activity or inactivity of the associated gene state.
In a further aspect, a method for gene network inference includes: accessing observation data associated with a gene network including a quantity of RNA for a plurality of cells observed across a plurality of time points; sampling a set of probability values associated with observing the observation data for values of each respective parameter of a plurality of parameters through a measurement model (where the plurality of parameters include a set of discrete parameters including one or more gene states observable within the observation data and a success probability of each respective gene state of the one or more gene states being active and a set of continuous parameters including kinetic rates associated with transitions between the one or more gene states based on a local geometry of the measurement model across the plurality of time points); and jointly inferring, based on the set of probability values and the observation data, a set of most probable values of the plurality of parameters.
The method can further include: applying a Gibbs sampling scheme to iteratively sample probability values associated with values of each respective model parameter of the gene network from the measurement model; applying a Parallel Tempering scheme nested within the Gibbs sampling scheme for iteratively sampling probability values associated with values of each respective model parameter of the gene network from the measurement model; sampling the success probability for a gene state of the one or more gene states using an adaptive Metropolis-Hastings sampling scheme encapsulated within the Gibbs sampling scheme; and iteratively sampling probability values associated with the set of continuous parameters using a Hamiltonian Monte Carlo sampling scheme encapsulated within the Gibbs sampling scheme.
The measurement model can incorporate a nonparametric Bayesian formulation of a Chemical Master Equation that characterizes probabilistic temporal evolution of the gene network. Further, the nonparametric Bayesian formulation of the Chemical Master Equation can incorporate a load vector that dynamically represents activity or inactivity of the one or more gene states observable within the observation data. The measurement model can further include a joint posterior probability distribution expressive of a probability of observing the observation data given values of the plurality of parameters of the measurement model, where each gene state of the one or more gene states being respectively expressed as an element of a load vector, and where a prior probability distribution associated with the load vector includes a Beta-Bernoulli process prior probability distribution.
Accessing information on what drives a biological process often involves interrupting the process under observation and collecting snapshot data reporting back on the underlying network to be inferred. As measurement snapshot data themselves are stochastic, they necessarily motivate the development of probabilistic inference schemes to deduce the underlying reaction networks such as, for example, gene states and their connectivities from RNA counts. In such networks, nodes represent gene states, and edges represent biochemical reaction rates linking states. Simultaneously estimating reaction networks' constituent parameters and quantifying their uncertainties from snapshot data remains a challenging task.
Attempts to infer states and their connectivities under these conditions face three immediate challenges: 1) inherently stochastic measurements introduce uncertainty over which reaction network models may be warranted by the data; and 2) the number of allowed nodes, their connecting edges, and values for rates may be unknown; and 3) rates parameterizing these networks may be separated by multiple orders of magnitude. While Bayesian nonparametric methods rigorously propagate measurement uncertainty into uncertainty over states and their connectivities alongside parameters, resolving challenge (1), appropriately sampling these posteriors given items (2) and (3) remains an open question. The present disclosure outlines systems and methods that apply a hybrid Bayesian Markov Chain Monte Carlo (MCMC) sampler, directly addressing challenges (2) and (3) leveraging three methods: Hamiltonian Monte Carlo (HMC) which leverages local posterior geometries in inference to explore the parameter space effectively; Adaptive Metropolis Hastings (AMH) which learns correlations between plausible parameter sets to efficiently propose probable models; and Parallel Tempering which takes into account multiple models simultaneously with tempered information content. The methods are applied to data from single molecule Fluorescence in-situ Hybridization (FISH), a popular snapshot method probing transcriptional networks to illustrate the identified challenges and how the methods outlined herein address them.
Gene expression models (e.g., gene states and their connectivities), key toward understanding a cell's regulatory response, underlie experimental observations of single cell transcriptional dynamics. While RNA expression data encodes information on gene expression models, existing computational frameworks do not perform simultaneous Bayesian inference of gene expression models and parameters from such data. Rather, gene networks (composed of gene states, their connectivities, and associated parameters including kinetic parameters) are currently deduced by pre-specifying gene state numbers and connectivity prior to learning associated rate parameters. As such, the correctness of gene networks cannot be independently assessed which can lead to strong biases. By contrast, the present disclosure outlines systems and methods that enable simultaneously and self-consistently learning full distributions over gene states, state connectivities, and associated rate parameters from single molecule level RNA counts. Notably, the methods propagate noise originating from fluctuating RNA counts over gene networks warranted by the data by treating gene networks themselves as random variables, achieved by operating within a Bayesian nonparametric paradigm. The method is demonstrated on the lacZ pathway in Escherichia coli cells, the STL1 pathway in Saccharomyces cerevisiae yeast cells, and verify its robustness on synthetic data.
Toward unraveling gene expression models, the present disclosure considers experimental methods providing discrete RNA counts, focusing on single-molecule RNA Fluorescence in situ Hybridization (smFISH). In particular, smFISH provides snapshot data (referred to herein as “observation data”, see
To help highlight the major challenges facing computational inference of gene networks from snapshot smFISH data, consider a simple gene network, shown at the top left corner of
In a further simplification, some of these methods altogether ignore the intrinsic stochasticity of RNA counts in favor of mass action formulations, which predict the temporal evolution of the mean number of RNA molecules per cell. Mass action formulations are fundamentally insufficient for smFISH data, as RNA copies may be present at low numbers, rendering information on their copy number fluctuations vital toward extracting kinetic parameters. Even methods which resolve this issue using the forward Kolmogorov equation (chemical master equation or CME), which gives the probabilistic temporal evolution of single-cell RNA counts, estimate the number of gene states without rigorous Bayesian uncertainty propagation.
Many previous methods, particularly those advertised as less costly than Bayesian inference, fall into the broad category of Maximum Likelihood Estimation (MLE) methods. Some of these, in particular Adaptive Differential Evolution, may agree with Bayesian methods of the kind explored further herein. However, any algorithm of choice in obtaining MLEs would return a unique model, even if the data does not warrant such strong conclusions.
Furthermore, any attempt to determine which unique gene expression model is warranted by the data, would require integrating the (necessarily approximated) CME for many different sets of parameters in MLE estimation, with the added cost of estimating parameters across models separately. Therefore not only are methods to pick out unique models and their MLE parameters costly, but they also do not propagate inherent uncertainty from stochastic RNA counts into uncertainty over models. What is more, they do not leverage the computational benefit that may be derived from exploring models in parameter estimation. Thus, moving beyond MLE, the present disclosure is motivated by methods that simultaneously and self-consistently learn gene expression models and parameters from data.
Based on these observations, the present disclosure outlines a Bayesian framework evaluated using Markov Chain Monte Carlo (MCMC) to simultaneously derive and sample from a probability over gene states, their associated parameters (and thus gene state connectivity), given discrete RNA counts taken from smFISH snapshot data, The resulting Bayesian method outlined herein propagates inherent noise into model (e.g., gene state number) estimation.
This can be achieved within the Bayesian paradigm, which allows sampling from posterior probability distributions over gene networks. In order to construct these posteriors, prior probability distributions over gene state numbers are required, necessitating the use of Bayesian Nonparametrics (BNPs) which overcomes the problem of “overfitting” in model determination by placing priors over infinite candidate models. As with “normal” (parametric) Bayesian methods, a likelihood incorproates the data in a way that requires an approximate solution of the CME. With a likelihood and nonparametric priors at hand, the systems and methods outlined herein are shown to simultaneously and self-consistently estimate model structures (number of gene states) alongside associated rate parameters (and thus, gene state connectivity) between states, as warranted by observed stochastic RNA counts per cell.
Applicability of the methods are demonstrated by testing on synthetic data followed by two very different experimental systems: the IacZ pathway in Escherichia coli (E. coli) cells and the STL1 pathway in Saccharomyces cerevisiae (S. cerevisiae) yeast cells.
Using this information, the goal is to infer the gene expression model, i.e., infer both gene states and their associated rate parameters (and thus gene state connectivity).
In each gene state, indexed σl, the gene transcribes RNA copies at rate βl. All RNA degrade stochastically according to overall rate γ. A gene can transition stochastically, say from state σl to σl′, at rate kσ
where σl=σ denotes the initial gene state.
In order to infer θ within the Bayesian paradigm, a likelihood, P(
with Pθτ≡(Pθτ(σl1), . . . , Pθτ(σl, M), Pθτ(σ2, 1), . . . , Pθτ(σ2, M), . . . , Pθτ(σN1), . . . , Pθτ(σLM))T satisfying the CME, {dot over (P)}θτ=A·Pθτ, where A is a generator matrix, whose dependency on θ is outlined in more detail in Section 5.2.
In order to construct a posterior using the likelihood, prior probability distributions (priors) are required over all model parameters. The priors over quantities in θ can be selected for computational convenience and are detailed in Section 5.4.
This section expands upon the nonparametric prior used on gene states.
A nonparametric formulation must theoretically consider an infinite number of gene states and allow the data to winnow down these infinite possibilities to those warranted by the data. This is similar to regular (parametric) Bayesian methods which typically assume broad priors over parameters and eventually allow the data, incorporated through the likelihood, to sharpen parameter estimates (i.e., sharpen the posterior).
As a matter of computational convenience, the Beta-Bernoulli process is selected as a formal prior on the existence or non-existence of these states.
Put simply, an infinite number of intermediate binary (Bernoulli) indicator variables, bl, termed loads, are introduced. Each respective load equals 1 when an associated gene state σl is deemed necessary by the data, or 0 otherwise. To make computation feasible, a so-called weak limit L is introduced that sets an upper bound on the number of possible gene states. A load vector that collects loads {bl}l=1: L is referred to collectively as
The Beta-Bernoulli process prior on the loads reads:
where ql are hyperparameters describing the success probability of load bl being “active” or equal to 1, and ζ is a hyperhyperparameter. Given this prior, one can learn from the data which gene states are warranted (e.g., by determining which elements of the load vector can be assigned values of “1”).
Given the likelihood and all priors, an explicit form can now be constructed for the posterior probability distribution, P(
Importantly, the ability to efficiently explore this posterior, especially given the added difficulty of inferring states, enables escape from traps (local maxima) that have impacted the assessment of parameters of other methods (Section 6.2.2).
With this fact in mind, the system 100 can use an overall Gibbs sampling scheme to construct the Markov Chain of the MCMC sampling scheme. Within this Gibbs sampling scheme, the system can sample the initial condition, σ*, and loads,
To address problem 1), all parameters are sampled with Parallel Tempering (PT) in order to better explore the discrete parameters. Within the PT scheme, continuous parameters are proposed using Hamiltonian Monte Carlo (HMC) sampling, solving problem 2). Used in conjunction for the first time, these sampling schemes permit the inference of gene states and their associated parameters on reasonable time scales, avoiding local maxima mentioned earlier in Section 6.2.1.
Device 200 comprises one or more network interfaces 210 (e.g., wired, wireless, PLC, etc.), at least one processor 220, and a memory 240 interconnected by a system bus 250, as well as a power supply 260 (e.g., battery, plug-in, etc.). Device 200 may also include or otherwise communicate with a display device 230 which can communicate the results (identified parameters and gene states) to a user based on the smFISH observation data.
Network interface(s) 210 include the mechanical, electrical, and signaling circuitry for communicating data over the communication links coupled to a communication network. Network interfaces 210 are configured to transmit and/or receive data using a variety of different communication protocols. As illustrated, the box representing network interfaces 210 is shown for simplicity, and it is appreciated that such interfaces may represent different types of network connections such as wireless and wired (physical) connections. Network interfaces 210 are shown separately from power supply 260, however it is appreciated that the interfaces that support PLC protocols may communicate through power supply 260 and/or may be an integral component coupled to power supply 260.
Memory 240 includes a plurality of storage locations that are addressable by processor 220 and network interfaces 210 for storing software programs and data structures associated with the embodiments described herein. In some embodiments, device 200 may have limited memory or no memory (e.g., no memory for storage other than for programs/processes operating on the device and associated caches). Memory 240 can include instructions executable by the processor 220 that, when executed by the processor 220, cause the processor 220 to implement aspects of the various systems and methods outlined herein.
Processor 220 comprises hardware elements or logic adapted to execute the software programs (e.g., instructions) and manipulate data structures 245. An operating system 242, portions of which are typically resident in memory 240 and executed by the processor, functionally organizes device 200 by, inter alia, invoking operations in support of software processes and/or services executing on the device. These software processes and/or services may include gene network inference processes/services 290, which can include aspects of methods and/or implementations of various modules described herein. Note that while gene network inference processes/services 290 is illustrated in centralized memory 240, alternative embodiments provide for the process to be operated within the network interfaces 210, such as a component of a MAC layer, and/or as part of a distributed computing network environment.
It will be apparent to those skilled in the art that other processor and memory types, including various computer-readable media, may be used to store and execute program instructions pertaining to the techniques described herein. Also, while the description illustrates various processes, it is expressly contemplated that various processes may be embodied as modules or engines configured to operate in accordance with the techniques herein (e.g., according to the functionality of a similar process). In this context, the term module and engine may be interchangeable. In general, the term module or engine refers to model or an organization of interrelated software components/functions. Further, while the gene network inference processes/services 290 is shown as a standalone process, those skilled in the art will appreciate that this process may be executed as a routine or module within other processes.
A method 300 outlined herein and shown in
Referring to
Step 304 of method 300 includes sampling a plurality of probability values from a joint probability distribution associated with observing the observation data for values of each respective parameter of a plurality of parameters through a measurement model. The plurality of parameters include: a set of discrete parameters including one or more gene states observable within the observation data and a success probability of each respective gene state of the one or more gene states being active; and a set of continuous parameters including kinetic rates associated with transitions between the one or more gene states based on a local geometry of the measurement model across the plurality of time points. The measurement model can incorporate a nonparametric Bayesian formulation of a Chemical Master Equation that characterizes probabilistic temporal evolution of the gene network. The nonparametric Bayesian formulation of the Chemical Master Equation can incorporate a load vector that dynamically represents activity or inactivity of the one or more gene states observable within the observation data.
Step 304 can be achieved through step 306, which can include applying a Gibbs sampling scheme to iteratively sample probability values associated with values of each respective model parameter of the gene network from the measurement model.
Referring to
Steps 310 and 312 can be nested within the Parallel Tempering scheme of step 308, which is nested within each iteration of the Gibbs sampling scheme of step 306. Step 310 applies to discrete parameters (gene states, number of gene states, etc.) and includes sampling the success probability for a gene state of the one or more gene states using an adaptive Metropolis-Hastings sampling scheme encapsulated within the Gibbs sampling scheme. Step 312 applies to continuous parameters (kinetic rates) and includes iteratively sampling probability values associated with the set of continuous parameters using a Hamiltonian Monte Carlo sampling scheme encapsulated within the Gibbs sampling scheme.
Referring back to
This section outlines and describes the computational structure including formulation of the measurement model and construction of the inference strategy. The method operates on observation data in the form of single cell RNA count data, with mt
To infer these kinetic parameters, a gene state model is presented. The defining characteristic of the model, the number of gene states, ultimately dictates the number of kinetic parameters involved in the underlying dynamics.
Below, the present disclosure first describes a simpler analog of the method for deriving kinetic parameter approximations when the number of gene states is held at a fixed quantity, i.e., a parametric framework. Then, the present disclosure expands the inference framework to the nonparametric paradigm to infer both the number of gene states and kinetic parameters.
Working within a Bayesian paradigm, the given single cell RNA counts must first be connected to the model parameters. For ease of explanation, the present disclosure starts by describing a parametric formulation, for which the number of gene states is provisionally fixed to two. For the two-state model, each gene state can be labeled using σl for l=1, 2. Now, all associated kinetic parameters can be described which include the initial gene state σ*, either σ1 or σ2; the transition rates, kσ
Armed with θ, the kinetic parameters are related to the single cell RNA counts
Here, A is a square block matrix which depends on θ and whose size reflects the number of gene states.
To describe the structure of A, the present disclosure starts with the nonzero elements associated with the nth row, arbitrarily chosen, for which n<M and is therefore associated to the derivative dPθt(σ1, n)/dt. The nonzero elements include:
Note that, as the number of RNA is unbounded, the infinite set of CME ordinary differential equations (ODEs) can be truncated by a preset maximum (M) RNA count within a given dataset.
Given the generator matrix's form, A, the CME's structure is defined for which the solution, Pθt(σ1, m), satisfies:
Provided the above, the likelihood is constructed, P({dot over (m)}|θ), which is a measure of how probable the data, m is given a set of parameters θ. Since the single cell RNA counts are independent and identically distributed, a likelihood of the following form is built:
Likelihoods for the one state and three state models follow a similar structure and are described in detail below.
Moving into the nonparametric regime, the likelihood must be adjusted dynamically based upon the number of gene states introduced into the gene state model during model inference. Therefore, intermediate binary indicator variables are introduced, bl, termed as loads, which correspond to each possible gene state and can be either 1 or 0. The loads affect the configuration of A, and thus the CME's solution, by multiplying all rates associated to their respective gene state. Specifically, the production rate for gene state n, βl, is multiplied by load bl. Similarly, the transition rate from gene state l to gene state l′, kσ
Finally, the degradation rate is multiplied by each respective load, since if a gene state is inactive, the gene will never occupy that state, and therefore degradation will not occur in that state. This change is demonstrated using the same nonzero elements of A as were outlined for the parametric two state model,
where L is chosen as a weak limit imposed on the maximum possible number of gene states for computational feasibility. In this case, this row corresponds to the derivative dPb,θt (σ1, n)/dt which is as described for the two state model but now incorporates the new dependencies on the load variables, grouped as b=(b1, b2, . . . , bL), as shown for A (n, 1:L*(M+1)).
In this way, each load determines whether or not its corresponding gene state, and thus the associated rates, is warranted by the data. The model parameters encompassed in θ grow to include all possible transition rates and production rates.
In the Bayesian paradigm, now that the likelihood is defined, priors must be placed over all parameters to be inferred. Choices for priors over the kinetic parameters as well as the initial condition gene state are relatively straightforward and are described in Section 5.4. To place priors over the loads, bn, the system can employ Beta-Bernoulli process priors:
where qn re hyperparameters which describe the success probability of load bn being “active” or equal to 1, and ζ is a hyperhyperparameter. Given this setup, it is possible to determine which gene states are deemed necessary by the data.
Given the likelihood and all priors, the fully joint posterior probability distribution can be defined for both parametric, P(θ|
The main bottleneck in the model is the high computational expense associated with numerically approximating CME solutions. In particular, as the size of the CME is given by the structure of the generator matrix, A, the computational cost grows for data sets associated with high RNA counts (highly expressed genes) and gene state models incorporating higher numbers of gene states.
Loads are defined as:
{tilde over (β)}l=log10(βl) and a similar convention is kept for all other rates. Finally, the set of parameters are collected in {tilde over (θ)}=(
Let Pt, σ,
where M is the maximum number of RNA per cell considered (and fixed at some high albeit reasonable number for computational efficiency). Then the master equation for the BNP model, in matrix form, reads:
with the matrix,
provided that kσ
Initial condition used is:
In some examples, the system can employ an overall Gibbs sampling scheme to construct the Markov Chain. The Gibbs scheme is a method for updating a full set of parameters which completely parameterize a model. For convenience, the present disclosure denotes this complete set of parameters as Λ. In each step of the Gibbs scheme, the system obtains a new set of parameters, Λ, from a set of old parameters ΛOld by proposing multiple disjoint subsets θiprop of Λ for i=1: I, which together, would completely parameterize the model. For each i in sequence, the system applying the Gibbs scheme provisionally updates the full set of parameters to Λprop, using the values which make up θiprop. Then, the system compares the posterior probability of Λprop (which now contains the parameters from θiprop in addition to the most recent remaining parameters in Λ) is compared to that of Λ. If the comparison favors Λprop, then the update Λ=Λprop is made, otherwise Λ remains unchanged. In this way, once all I subsets have been assessed and either accepted or rejected, the system can obtain new estimates for all parameters in Λ.
Within the Gibbs sampling scheme, the system samples the initial conditions σ* and
The system can use direct sampling within a Gibbs sampling scheme to sample the initial condition, σ* and bl for l=1, . . . , L, jointly from the marginal posterior distribution:
The system can sample initial condition gene state and the loads through posteriors of all possible initial condition gene states with all possible load combinations:
constructing the Categorical distribution and sampling the initial condition gene state:
The conditional probabilities can be written as:
The logarithm of the conditional probability can be written as:
The system can employ a mixture of MH and Hamiltonian Monte Carlo (HMC) sampling schemes to sample the reaction rates. Both cases involve sampling from the marginal posterior distribution:
The system can employ a Metropolis-Hastings (MH) sampling scheme to sample the initial success probability vector,
The system can take proposals within the MH sampling scheme directly from the prior distribution, giving an acceptance ratio of:
In log form:
The system can employ direct sampling within a Gibbs sampling scheme to sample the initial condition, σ* from the marginal posterior distribution:
The system can sample the initial condition gene state through posteriors of all possible gene states:
constructing the Categorical distribution and sampling the initial condition gene state:
The conditional probabilities can be written as:
The logarithm of the conditional probability is:
The system can apply direct sampling within a Gibbs sampling scheme to sample the loads b from the marginal posterior distribution:
To sample the loads using direct sampling within the Gibbs sampling scheme, the system samples the loads simultaneously through posteriors of all configurations of loads:
constructing the Categorical distribution and sampling the configuration of loads:
However, since the Categorical distribution has 2L arguments, the computational cost can be lessened by sampling a smaller set of loads. A random set of loads (
The logarithm of the conditional probability is:
The system can propose new samples using a multivariate normal distribution demonstrated below for the parameter {tilde over (β)}1 and {tilde over (γ)},
where Σ[2
In log form this becomes:
The sampling method used here is the same as in the linear space, however the present disclosure defines the Hamiltonian systems based upon the new system. Therefore, {tilde over (q)}={tilde over (θ)}, and the Hamiltonian function is:
and T(P) is the same as in the linear space.
In order to advance a half step forward using H1({tilde over (q)}, p), the present disclosure now considers the change of variables {tilde over (q)}=log10(q) to find:
is the same as Vq(q) above.
In order to advance the full step forward using H2({tilde over (q)}), use L{tilde over (q)}({tilde over (q)}), demonstrated for the rates {tilde over (β)}l and {tilde over (γ)},
To approximate, a second order and symplectic implicit midpoint rule integration can be applied using custom MATLAB code:
Solving for Pnew, and demonstrating with the parameter {tilde over (β)}l:
Plugging Eq. (2) into Eq. (1) and following these algebraic steps,
the following can be arrived at:
Due to the high number of model parameters, the present disclosure analyzes the method's sensitivity to changes in rates by varying the maximum production rate (β1). If all other rates are maintained, changing β1 corresponds to changing the expected number of RNA transcripts present in the cells. By varying β1 from β1=0.3 s−1 to β1=1 s−1 (see
In general, it is difficult to predict whether the amount of data provided to the method is sufficient to infer all rates simultaneously. To wit,
The examples above assume the availability of snapshot smFISH data containing RNA counts per cell, mt
Here, the present disclosure shows ability of the methods to perform model selection and parameter inference for gene networks using both experimental and synthetic data. In BNPs, the model structure (which, in this case, corresponds to a number of gene states) is treated as a parameter, and can thus be inferred alongside all other parameters. The remaining parameters of interest are: the production rates, βl, for each gene state; the transition rates between various gene states, kσ
First, the present disclosure demonstrates robustness of the method on synthetic data which replicates dynamics similar to experimental data sets, but covers a broader range of scenarios than are available experimentally. Subsequently, results are shown for experiments on the lacZ pathway in E. coli cells and the STL1 pathway in S. cerevisiae yeast cells.
In line with the current literature on gene expression, the method is evaluated on three different models consisting of one, two, and three gene states. In the two and three state models, one state has a production rate of zero.
Perhaps counter-intuitively for the simplest of gene networks, which should be easiest to infer, the posterior probability distributions over gene states are broader. The reason is subtle: models that estimate a greater number of gene states can approximate a one state model (but not vice versa) by having nearly identical production rates for each of the gene states. However, since the production rates of all possible gene states are not perfectly identical, the methods disclosed herein still favor the true number of states.
For a detailed overview of the remaining robustness analysis of the method, refer to Section 5.7.
Synthetic data used in Section 6.1 was generated using computer simulation based on Gillespie's Stochastic Simulation Algorithm. Details of the model are outlined in Section 5.1, with inference of all parameters detailed in Section 5.1.
smFISH RNA count data from E. coli and S. cerevisiae cells are analyzed. For simplicity here, the RNA molecules' degradation rate is calibrated using previously-established results.
6.2.1 E. coli
This section demonstrates simultaneous gene state number and parameter inference for the lacZ pathway in E. coli cells, grown in slow-growth media. Results are shown in
The states posited in the other work agree with what is learned directly from the data. In terms of parameters, general agreement is also found in the lowest production rate. However, disagreement of 70%, 40%, and 43% is found in parameters kσ
A trace of log likelihood of the present method surpassing that of another work can be seen in
Interestingly, despite the significant difference between e and 0′, RNA count histograms across time points appear qualitatively similar; see
However, in order to directly compare likelihoods, the present method can be restricted to infer parameters by imposing by hand a two state gene expression model, with one production rate fixed at zero (
This again suggests the presence of local maxima that may lead to incorrect parameter value estimates even when assuming a simpler model with fewer states. Once more, this result underscores the need for simultaneous optimization methods such as the methods outlined herein.
6.2.2 S. cerevisiae
for STL1 transcription. See
Inferring the most probable regulatory network structure for a given set of observed snapshot RNA expression data presents unique challenges that have stood in the way of accurately identifying the number and connectivity of biophysical reactions and constituent parameters, whether separately or simultaneously. The present approach achieves model and parameter inference in a self-consistent and simultaneous fashion and improves upon limitations of other approaches, including 1) the assumption of steady-state dynamics and 2) separation of model selection of gene state numbers and parameter inference.
Effectiveness of the system applying the methods outlined herein are using both experimental and simulated snapshot RNA expression data. For E. coli in fast-growth media, the present system determined (
Several additional extensions can be made to the system and associated methods outlined herein. First, with minimal modification, the present approach may utilize spatial information contained in snapshot RNA expression data quantified by smFISH to determine, for instance, transport rates of RNA from nucleus to cytoplasm. Indeed, the additional constraint of RNA transport from the nucleus to cytoplasm has previously improved parameter identification. Second, modifications of the measurement model within the present system may allow for time-varying rates of transcription, gene state transitions, and RNA degradation. Finally, as the density of gene species increases using highly multiplexed smFISH methods, the flexible network connectivity of the present approach may allow regulatory models to explore the most-likely regulatory networks for co-varying gene expression.
The above generalizations will introduce additional complexity to likelihood computation, which is already the costliest inference step. The additional complexity is directly due to the increase in the state number and complexity of the connectivity map. Both extensions would alter the generator matrix A (see Section 5.2), making it either larger for number of states or denser for connectivity. In the case where the generator matrix remains sparse, the CME solution's time cost scales roughly linearly with A's size, and FSP based Krylov subspace methods may be more optimal than the CME solution method used here. How the computational cost of A's CME scales with density is more complex than the number of states. Above a certain density, the recently proposed Quantized Tensor Train method may be more efficient, as the FSP based Krylov subspace approach uses incremental time stepping rather than jumping immediately to the times desired for analysis. Alternatively, there have been promising attempts to solve ODEs using neural networks. In addition to facilitating the difficulties arising due to dense CME generator matrices, neural network approaches may further enable parameter inference for non-Markovian models of gene transcription.
It may also be possible to deduce gene networks from direct image gene expression dynamics in real-time within living cells. However, such approaches obtain real-time kinetics of a limited number of molecules at the expense of higher data density. What is more, genetic manipulation limits the accessible insight to local molecular and biophysical interactions.
The desire to understand downstream consequences of gene expression, for example, through predictive modeling, directly motivates the use of snapshot RNA expression data especially for higher data density. While removing temporal correlations in snapshot RNA expression data immediately hinders the ability to obtain direct insight into gene regulatory dynamics, the knowledge gaps may be filled by increasing the density of time points, RNA species, cells, and stimuli conditions in snapshot RNA expression data. Indeed, the present disclosure shows how to maximize the information deduced from snapshot RNA expression data and obtain probabilities over gene regulatory network structures and constituent rates. This is achieved by reframing the gene regulatory network identification problem within the Bayesian nonparametric paradigm and developing the requisite tools for inference over gene states, connectivity, and parameters. The probabilistic output of the approach introduced may now allow us to learn networks reflecting the confidence that any given snapshot RNA expression data set supports.
It should be understood from the foregoing that, while particular embodiments have been illustrated and described, various modifications can be made thereto without departing from the spirit and scope of the invention as will be apparent to those skilled in the art. Such changes and modifications are within the scope and teachings of this invention as defined in the claims appended hereto.
This is a U.S. Non-Provisional Patent Application that claims benefit to U.S. Provisional Patent Application Ser. No. 63/483,578 filed 7 Feb. 2023, which is herein incorporated by reference in its entirety.
This invention was made with government support under RO1 GM130745 and RO1 GM134426 awarded by the National Institutes of Health. The government has certain rights in the invention.
| Number | Date | Country | |
|---|---|---|---|
| 63483578 | Feb 2023 | US |