The present disclosure relates generally to improvements in computer science having applications in any industry that can benefit from the study of genes, phenotypes, and/or DNA/RNA. More particularly, but not exclusively, the present disclosure relates to genomic-word-framework analysis of genomic methylation data.
The background description provided herein gives context for the present disclosure. Work of the presently named inventors, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art.
DNA sequences carry not only information for how to build proteins, but also the regulatory information for living organisms to survive and reproduce, which involves but is not limited to epigenetic information that controls chromatin behavior.
Many diseases and the presence of many phenotypes are not well understood. This is because genes' phenotypic penetrance and expressivity vary due to the different combinations of modifying alleles that are present in one genetic background versus another.
Thus, there exists a need in the art for an apparatus which addresses the potential for superpositioning of more than one structured language throughout the nucleotide sequence of a genome.
The following objects, features, advantages, aspects, and/or embodiments, are not exhaustive and do not limit the overall disclosure. No single embodiment need provide each and every object, feature, or advantage. Any of the objects, features, advantages, aspects, and/or embodiments disclosed herein can be integrated with one another, either in full or in part.
It is a primary object, feature, and/or advantage of the present disclosure to improve on or overcome the deficiencies in the art.
It is a further object, feature, and/or advantage of the present disclosure to provide an extension for a general purpose programming language or a statistical programming language. In an embodiment, said extension comprises algorithm(s) that analyze(s) methylation signals on stretches of DNA sequences. In an embodiment, the DNA sequences are characterized by (i) methylation information and (ii) physicochemical information around each methylated cytosine. In an embodiment, the algorithm(s) include one or more functions that can estimate a distance matrix on a set of selected regions of said DNA sequences; analyze a hierarchical cluster on the set of selected of regions; group the set of selected regions into a specified number of clusters; and align multiple DNA sequences from the clusters into methylation motifs.
In some embodiments, the extension is written in the R statistical language.
It is a further object, feature, and/or advantage of the present disclosure to reduce computation load. For example, differentially methylated genes (DMGs) identified with methylation analysis can be integrated to gene networks to identify network hubs via protein-protein interactions network analysis and weighted correlation network analyses. Weighted correlation network analyses based on previous knowledge of DMPs and DMGs has no precedent so far.
It is a further object, feature, and/or advantage of the present disclosure to provide a computerized heuristic. In some embodiments, the computerized heuristic comprises a high order DNA base interdependence with respect to methylated cytosines; and a base distribution that is statistically nonrandom.
According to some aspects of the present disclosure, the heuristic can comprise (1) a statistic (sum, mean, or density, etc.) of an information divergence (ID) estimated for each gene carrying at least one DMP on it (on gene-body or on promoter region); (2) principal component analysis (PCA) wherein the first k-th components carrying 1% or more of the whole sample variance are considered in the downstream analysis; (3) computation of a correlation matrix carrying the pairwise gene correlation, represented as vectors of PCs; (4) analysis of correlation matrix for a network; and (5) contribution of each gene to the discrimination of phenotypes evaluated in terms of the fraction of a cumulative variance from a whole sample variance carried by the gene.
According to some additional aspects of the present disclosure, wherein the ID is selected from the group consisting of: Hellinger divergence/distance, J divergence, total variation distance, etc.
According to some additional aspects of the present disclosure, the PCA can be applied with a pcaLDA function described in the '986 Patent. As a result, genes are represented as k-dimensional vectors of PCs, where the square of each coordinate carries the vector contribution (in terms of variance) to the treatment discrimination from the control group. A correlation matrix can mathematically equivalent (in terms of information) to a weighted correlation network (WCN).
According to some additional aspects of the present disclosure, the WCN is analyzed as was done for the network, which can be a PPI network. New knowledge retrieved from the WCN derived from the raw methylation data and it does not depend on our believe or biological knowledge about the genes presented in the network. Results from the WCN and the PPI network are compared to identified consistent relationships and epigenetic gene contributions to the phenotypes.
According to some additional aspects of the present disclosure, the heuristic further comprises a magnitude is computed as the Euclidean Norm of the gene represented as a vector of k PCs.
It is still yet a further object, feature, and/or advantage of the present disclosure to selectively build motif libraries. For example, methylation motifs can be identified in all DMGs, which provide the raw material to build motif libraries. These libraries can then serve as the fundamental dataset needed to build predictive models with applications in plant science and biomedical research.
Genomic-word-frameworks and the genomic methylation data disclosed herein can be used in a wide variety of applications. For example, such GWF-based model predictions can be used for identifying and treating patients of autism, cancer, and other diseases that benefit from early diagnostics. Said models could also help provide further understanding in discovering causes for (1) phenotypes that are not at present well-understood and (2) multifactorial diseases seemingly caused by both genetic and environmental factors, such as diabetes and alcoholism.
The visual representation of genomic-word-framework analyses can be automatically and intuitively configured so as to quickly convey meaning to those interpreting same. Therefore, at least one embodiment disclosed herein can comprise a distinct aesthetic appearance. Ornamental aspects included in such an embodiment can help further a person's understanding of the potential relationship genomic methylation data has to applications within the physical world (e.g. phenotype).
Methods can be practiced which facilitate use, manufacture, assembly, maintenance, and repair of libraries of DNA methylation motifs which accomplish some or all of the previously stated objectives.
It is a further object, feature, and/or advantage of the present disclosure to provide methods for analyzing methylation signals on stretches of DNA sequences. In some embodiments, the method comprises analyzing a hierarchical cluster on regions of the DNA sequences; grouping a set of selected regions hierarchically into a specified number of clusters; aligning potential DNA sequence motifs from said clusters; and applying digital signal processing to the encoded methylation and physicochemical signals.
The creation and maintenance of libraries can further be incorporated into automated, heuristic analysis processes which constantly refine and improve DNA base interdependence with respect to methylated cytosines until they achieve a base distribution that is statistically nonrandom.
These and/or other objects, features, advantages, aspects, and/or embodiments will become apparent to those skilled in the art after reviewing the following brief and detailed descriptions of the drawings. Furthermore, the present disclosure encompasses aspects and/or embodiments not expressly disclosed but which can be understood from a reading of the present disclosure, including at least: (a) combinations of disclosed aspects and/or embodiments and/or (b) reasonable modifications not shown or described.
Several embodiments in which the present disclosure can be practiced are illustrated and described in detail, wherein like reference characters represent like components throughout the several views. The drawings are presented for exemplary purposes and may not be to scale unless otherwise indicated.
An artisan of ordinary skill in the art need not view, within isolated figure(s), the near infinite number of distinct permutations of features described in the following detailed description to facilitate an understanding of the present disclosure.
The present disclosure is not to be limited to that described herein. Mechanical, electrical, chemical, procedural, and/or other changes can be made without departing from the spirit and scope of the present disclosure. No features shown or described are essential to permit basic operation of the present disclosure unless otherwise indicated.
Genomic-word-framework (GWF) analysis of DNA methylation involves analysis of methylation motifs and digital signal processing. GWFs are stretches of DNA sequence covering differentially methylated positions (DMPs). GWFs are however not to be confused with the concept of a DNA sequence motif. A word-framework (WF) can include one or more motifs. That is, a ‘sentence of WFs’ is also a GWF. DMPs can be identified by methylation analysis with an extension in a statistical programming language, such as the R package described by the inventors of the present application in U.S. Pat. No. 10,913,986. The analysis permits the identification of DNA sequence methylation motifs found in genes with potential epigenetic regulatory functionalities, including those induced by environmental changes or disease.
One potential embodiment of an analytical heuristic described herein has been implemented in an R package named GenomicWordFramework. More particularly, GenomicWordFramework is a utility package to identify the potential genomic word framework (GWF) regions of a hypothetical language. The GWFs facilitate further analysis of DNA sequences that carry genomic signals with the application of digital signal processing (DSP) and machine-learning (ML) tools from other R packages. GenomicWordFramework includes several functions to accomplish reading and data transformation to a suitable form for application of different statistical approaches to data analysis, like clustering algorithms and statistical tests.
GenomicWordFramework can utilize prior identification of DMPs with the R packages specific to methylation analyses, such as those described in U.S. Pat. No. 10,913,986. The use of methylation analyses and GWF analyses described herein can therefore form a pipeline that implements a signal detection and a machine-learning approach permitting filtering of signal from noise at a high rate.
The discriminatory power of GenomicWordFramework was first tested in a small data set of the msh1 mutant system in Arabidopsis thaliana (biological model) and was later applied to a published study of DNA methylation analysis of placental tissues of typically developing and autistic children. Genomic-word-framework analysis is a proven concept compatible with a near limitless number of differentially methylated network-hubs. The differentially methylated network-hubs, previously identified with methylation and network analyses in the autism study, relate to biological processes that include: the nervous system, nervous system development, synapse, neuron projection, central nervous system disease, axon guidance, neurogenesis, ion/cation biding, ion/cation transmembrane transport, voltage-gated channel.
Results indicate that GWF based heuristics can identify DNA sequences of methylation motifs with high order DNA base interdependence with respect to methylated cytosines and a base distribution that is statistically nonrandom. These findings set the basis for further model prediction in patients, not only for autism but also other diseases like cancer that would benefit from early diagnostics.
In other words, GWF analyses are able to identify sets/clusters of synonymous methylation word-frameworks within genes that undergo targeted methylation changes and participate in gene networks that are involved in biological processes relevant to the system under study. GWF further lays the groundwork for the creation of libraries of DNA methylation motifs intended for patient diagnostics and prognostics. GWF analyses therefore substantially increase the utility and value of identification of DMPs and differentially methylated genes (DMGs) with methylation analysis.
GenomicWordFramework is an extension in written in a statical programming language. More particularly, GenomicWordFramework is an R package that is designed for analysis of methylation signals on stretches of DNA sequences that are characterized by not only methylation information, but also physicochemical information around each methylated cytosine (plus adenine in the case of animals and bacteria). The package's functions transform the methylation data to a suitable format to be accessible for further DSP analyses beyond R packages.
GWFs are identified in a methylome in two possible ways: (1) applying the algorithm described in Sanchez et al., “Information Thermodynamics of Cytosine DNA Methylation”, published Mar. 10, 2016, which is herein incorporated by reference in its entirety; and (2) direct extraction of DNA sequence stretches covering specified numbers of DNA bases upstream and downstream of DMPs. The GWFs obtained with the first approach can be further analyzed with digital signal processing (DSP) tools. GWFs obtained with the second approach can be clustered into groups of aligned motifs and further tested to evaluate departure of each of multiple sequence alignments (MSA) from random Monte Carlo simulated MSAs. Statistically significant motif regions are extended for further applications of DSP analyses.
All information needed to search for a DNA motif, DNA sequence and modified nucleotide base can be stored in a singular binary string using a triplet representation of each base taking into account the number of hydrogen bonds in the Watson-Crick base pair, the chemical type of the DNA base (pyrimidine and purine), and the base modification status. By convention, the DNA sequence is referred to the positive strand. For example, the following encoding of DNA bases can be used to study the methylation signal:
Each base will comprise a binary string:
Each base will comprise a complex number (a+ib), the cyclic group integrated by the 8th roots of unity: e
where k=0, . . . , 7.
More than one complex encoding is possible. Encoding based on a group structure (here an Abelian group) can be preferred for DSP analysis. This is a group structure defined on the set of DNA bases including methylated cytosine and adenine member of the alphabet, as described further below. Analysis of a complex signal in GWF R package can be difficult, complex encoding(s) and signal(s) can later be exported to other languages like C++, Python, or MatLab.
Specific details for the implementation of this encoding are provided in the documentation of a function that encodes a previous detected binary signal of 0s and 1s from a DNA sequence into a numerical code defined by the user. Given two objects, one carrying the signal and the other one carrying the DNA sequence, such a function can perform the encoding set out by the user. The binary encoding of the methylation signals permits the incorporation of the physicochemical information in the DSP analyses of DNA sequence motifs.
Small regions of usually 7-30 bp spanning DMPs on at least three samples were identified and considered as DNA methylation motif candidates in 67 genes. A distance matrix can be estimated on the set of selected regions using a function from the R package that computes the matrix distance between the aligned sequences from each multiple sequence alignment (MSA). Next, a hierarchical cluster analysis on the set of selected regions (using the previously estimated distance matrix) can be accomplished with a function that utilizes a matrix of a selected Information Divergence to group the selected regions into 100 clusters. It is to be appreciated Hellinger divergence is only one of the possible information divergences that can be estimated and applied here. For example, J-divergence is more appropriated for application intended to extract new knowledge in terms of information-thermodynamics of the epigenome phenomena. J-divergence is the symmetric version of relative entropy.
An unweighted pair group method with arithmetic mean (“UPGMA”) approach was applied as agglomeration algorithm. In one embodiment, clusters with less than ten (10) regions on it are discarded. DNA multiple sequence alignment on each cluster of sequences can be accomplished with the MUltiple Sequence Comparison by Log-Expectation (MUSCLE) algorithm implemented on an R package for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.
A further portioning of the set of motifs can be applied for downstream analysis. There is a wide spectrum of clustering algorithms that can be applied. For example, fast k-medoids clustering can be applied and implemented using algorithms of distance-based k-medoids clustering: such as simple and fast k-medoids, ranked k-medoids, and increasing number of clusters in k-medoids. In one embodiment, said algorithms are those included in the kmed R package.
The cluster results can be plotted in a marked barplot or pca biplot. The final partition into clusters depends on the clustering algorithm applied and their corresponding parameter settings, including the type of metric applied to compute the distance matrix required for clustering algorithm. Methylation motifs are objective DNA sequence features, and the applied clustering algorithm is only a supporting tool that leads to motif identification.
The motif score sjk of the aligned sequences j and k can be defined in an intuitive way: as the logarithm base 2 of the number of matched bases found in the alignment. Formally:
where
for every base position i on sequences j and k. Then, the maximum motifs score is: Max{sjk}=log2N. Next, the motifs score in a MSA is defined as:
For a MSA with M sequences of length N each, the number of pairwise comparisons is:
As a result, for a fixed value of the motif size, the perfect MSA of DNA sequence motifs will have the maximum score: Max{S}=log2N. In other words, in this modeling, the maximum amount of information carried by an MSA is: Imax=log2N, and the amount of information (the uncertainty change) carried by a MSA is given by the expression:
The same result is obtained if the letter frequencies in the MSA and Shannon entropies, before (perfect alignment) and after, are estimated instead of the matches. Then, alignment information is computed as:
A Monte Carlos testing (MCT) on how a given DNA multiple sequence alignment differs from randomly generated MSAs was implemented in an R function included in the GenomicWordFramework R package. To accomplish the MCT, it is assumed:
The parameter vector α=(α1, . . . , αk) for a specific Dirichlet distribution is estimated from the whole set of identified DNA motif candidates. Given a matrix of DNA methylation motifs with N columns corresponding to the whole set of identified motif candidates, the frequency of each DNA base in each column is tallied, resulting in a four-dimensional vector of counts for each column in the data set, where each vector coordinate carries the absolute frequency of one of the four DNA bases in a given alignment column. The resulting N×4 matrix of counts is the raw count data used in the parameter estimation of Dirichlet distribution applying a function from the R package that estimates a family of continuous multivariate probability distributions: a multivariate generalization of the Beta distribution. Next, random DNA MSA sequences are generated according to the estimated Dirichlet distribution with a probability density function (PDF) or cumulative density function (CDF) from the R package. That is, random DNA MSA were generated sampling from the estimated Dirichlet distribution.
Monte Carlo p-Value
For MSAs of fixed length N, the log2 N is a constant, so for the purposes of MCT it is sufficient to consider the score statistic given in Eq. 2 and to evaluate how much an observed aligned motif differs statistically from Monte Carlo simulated aligned sequences. The Monte Carlo p-value is estimated as:
Where S0 stands for the alignment score of the MSA to be tested, Si is the alignment score for the ith Monte Carlo simulated MSA, and
It is important to notice that the raw observed frequencies from a small matrix of motifs are often poor approximations to the distribution of DNA bases among all motifs that the model is supposed to represent. However, for a typical analysis of 50 or more genes, the matrix of motifs needed for the Dirichlet distribution model estimation would, in general, carry thousands of motifs.
The binary-encoded methylation signal is raw data for digital signal processing (DSP) tools. There is a huge number of possible applications of DSP tools. The GenomicWordFramework R package can include some the application of wavelet spectrogram via wavelet transform, as well as the traditional Fourier power spectrum and spectrogram.
The Multiplicative Group of DNA Extended Alphabet with Methylated Bases
Let ={Cm, C, Am, T, C−m, A, A−m, G} be the ordered set of DNA bases plus the methylated adenine (Am) and cytosine (Cm) in the positive strand and in the negative strand A−m and Cm. Let us define on
a multiplicative group (
,x) with multiplication operation ‘x’, where C−m is the multiplication unit and the unmethylated DNA complementary bases are algebraic complementary as well, i.e.: C×G=Cm and A×T=Cm. The last algebraic-biophysical constraints are hold if base C is the group generator, i.e., the generator condition C8=Cm and the condition C×G=Cm imply C7=G. Since C3×C5=C8, setting C3=T implies C5=A. Preserving the order of bases in the set, we have C×C=C2=Am, C4=C−m, C6=A−m.
The group (,x) defined above is an Abelian cyclic group isomorphic to the cyclic group integrated by the 8th roots of unity:
where k=0, . . . , 7. Although we can accomplish the symbolic algebraic operations on (,x), for the sake of concrete applications in computational biology and in bioinformatics, it is convenient to operate with the cyclic group defined on the set
The elements of this group, written in the order sets by the bijective mapping
where i is the imaginary unit defined in the set of complex numbers.
Complex Encoding of Methylation Signal with GWF R Package
library(GenomicWordFramework)
We will use methylation signal from three gene regions from Arabidopsis thaliana data(at_signal, at_gene_seq, package=“GenomicWordFramework”)
Each base will comprise a complex number (a+ib), the cyclic group integrated by the 8th roots of unity
where k=0, . . . , 7.
In GWF R package, the encoding of the methylation signal is accomplished with function signalEncoding. In the current case, we used the generator of the group is:
to easily set the encoding:
The signal ready for exporting can be retrieved with function getEncoding:
The whole matrix of encoded signal can be retrieved:
The matrix carrying the signal can be exported to Python or MatLab.
The whole exportable numerical matrix (showing two rows and the only the 20 first columns):
The following non-limiting numbered embodiments also form part of the present disclosure:
For concrete application on raw genomic data, the concepts, algorithms, and formulas must be implemented in some computation language. While it is to be appreciated a wide variety of computation languages could be employed, this example chooses the R statistical language and relies on the R package named GenomicWordFramework. The steps and results obtained in this example come from the application of the heuristic to a concrete (and small) experimental data set. Results for a larger data set (in humans) are presented in a later example.
The package goal is to derive objects that can be useful for further applications of DSP and ML tools available in others R packages. However, the application of some basic DSP tools is provided as well.
Signal Analysis of an Arabidopsis thaliana Experimental Dataset
An example with empirical methylation signal data is illustrated using a dataset included with the package. The experimental dataset carries the methylation levels from Arabidopsis Columbia-0 ecotype (Col-0) and the msh1 mutant (dwarf phenotype). The methylation data, derived with the previous application of methylation analyses, are included as dataset with package. The DMP data set can be loaded from the package:
To Retrieve a Genomic Signal from its DNA Sequence
Here, the coordinates of Arabidopsis annotated genes given in the GTF file downloaded from ftp://ftp.ensemblgenomes.org/pub/plants/release-49/gtf/arabidopsis_thaliana/.
The coordinates of genes are used to get the corresponding DNA sequences carrying the signal. For the sake of brevity, the DNA sequence is available with the package.
The sequences for this example are provided with the package.
Next, the binary signal on the gene regions is retrieved with function getSignalAtRegions. The methylation signal is provided for the three genes in five different Col-0 samples and five msh1 mutants (dwarf) samples. Only DMPs are included.
In different samples, methylation can be located in different positions, so there is a set complete.region=TRUE to retrieve information from the entire regions and zero_signal=TRUE to permit positions with no signal.
This dataset is included with the R package:
data (at_signal)
Additional results supporting our heuristic for identification of methylation motifs in the Arabidopsis thaliana methylome is provided below:
msh1 epi-lines comprise a distinct nongenetic state based on phenotype. Manipulation of the msh1 mutant leads to four distinct phenotypes, with states 1 and 2 characterized by slowed growth, delayed flowering and persistent stress response, and states 3 and 4 producing enhanced growth vigor and greater seed set over wild type (WT) (
4 F3 populations were followed of cross-derived epi-lines. Epi 8 and Epi 24 were sibling lines from one WT×msh1 cross event, and Epi 10 and Epi 19 were sibling lines from a second WT×msh1 cross. All four F3 epi-lines showed uniform phenotypes within each population, but significant variation between the four populations (
Enhanced reproductive growth occurred in the three epi-lines, Epi 8, 10, and 19, while Epi 8 showed early flowering (
The epi-line phenotype receded back to wild type by the fifth or sixth (S5, S6) generation. Over these sequential generations, sporadic incidence of reversion to a condition resembling memory (state 2) phenotype (
Plant features of the four msh1-derived states in Arabidopsis recapitulated in tomato (
The msh1 states 1 to 4 comprise discrete epigenetic phases by whole-genome methylome analysis. Significant changes in DNA methylation were detected in the four Arabidopsis epi-lines (F3), with gene-associated changes predominantly in CG context (
To estimate the relationship of genic differential methylation with changes in plant phenotype, a high-resolution methylome analysis was used. The procedure incorporates signal detection and machine learning for discrimination of high-probability, treatment-associated methylation changes within gene regions. Hierarchical clustering of methylome data from individual plants from all four states in Arabidopsis used methylation level changes (computed as Hellinger divergence) at differentially methylated positions (DMPs) in gene regions. The result showed clustering of individual plants (biological replicates) from the same population, with plants from different states separated to 6 branches distinct from WT control groups (
Similar results emerged in hierarchical cluster analyses of tomato genic methylome data. Tomato msh1 mutant (state 1), HEG (state 3) and epi-line datasets (state 4) formed three branches, consistent with distinctive gene methylation effects (
Methylation data support epi-line revertants as more closely related to msh1 memory state 2. Epi-line (state 4) revertant samples in Arabidopsis were included in methylome analyses to test for evidence of methylation repatterning in the revertant versus non-revertant full-sib samples. Principal component with linear discriminant analysis (PCA-LDA) of these datasets using Hellinger divergence produced distinct clustering of non-revertant from revertant individuals deriving from the same progeny population. The epigenomic features distinguishing revertant and non-revertant full-sib progeny presumably arose spontaneously. Revertant methylome datasets showed a closer relationship with data from msh1 memory state 2 (
Epi-line methylome datasets are altered in growth-related gene networks. Differentially methylated genes (DMGs) for each epi-line population were identified by applying generalized linear regression analysis (GLM) to test significance of the difference between group DMP counts (WT vs. epi-lines) in genes. This analysis identified 3204, 2860, 3208 and 2797 DMGs in Epi 8, Epi 10, Epi 19 and Epi 24, respectively. To investigate coincidence of DMG functional relationships in each population, a gene network-based enrichment analysis was conducted. DMGs from the different epi-lines shared enrichment for specific functional networks. For example, the full sibs Epi 8 and Epi 24 shared common enriched networks for response to auxin, response to red or far-red light, response to nutrient levels, photoperiodism, detection of abiotic stimulus, response to stress, and catabolic process (
Methylome data previously reported for the Arabidopsis F1 heterotic cross of ecotypes C24 and Ler was analyzed using the methylome analysis methods that were applied to the msh1 datasets. The enriched networks emerging from F1 hybrid (C24×Ler, and Ler×C24) data showed remarkable conformity to what was identified in msh1-derived epi-line data, emphasizing pathways for response to red or far-red light, response to auxin, regulation of growth, cellular response to auxin stimulus and auxin-activated signaling pathways (
Examination of gene expression changes in Arabidopsis epi-line populations involved sampling three different tissues for RNAseq analysis. Epi 8 and Epi 10 were contrasted with wild type for differential gene expression in leaf tissues, by similar sampling to that done for methylome studies, and Epi 8 and Epi 24 were additionally compared by analysis of floral stem and root tissues. From leaf tissues, 1884 differentially expressed genes (DEGs) were identified in Epi 8 and 992 in Epi 10 relative to wild type. Epi 8 and Epi 24 analysis revealed 1991 DEGs from floral stem in Epi 8 and 1650 for Epi24 and, from root, 1133 DEGs in Epi 8 and 1111 in Epi 24 relative to wild type.
Network enrichment analysis of derived DEG datasets showed shared pathways altered in response to msh1 effects in leaf, floral stem and root tissues. The most enriched of these involved abiotic and biotic stress responses, with circadian rhythm- and phytohormone response-related networks also significantly enriched in the three tissues. Regulation of transcription was prominent specifically in floral stem, where MSH1 accumulates (
Differential methylation and expression analysis identified central gene hubs for the epi-line state. To investigate the interaction of DMG and DEG datasets in epi-lines involved inputting DMG and DEG data to Cytoscape to construct protein-protein interaction (PPI) maps, followed by K-means cluster analysis to identify putative core networks carrying central gene hubs. A K-means cluster machine learning algorithm uses betweenness centrality, closeness centrality, average shortest path length, clustering coefficient, degree, and eccentricity as parameters, allowing the identification of clusters that contain the most centralized nodes (proteins) in the PPI network.
In Arabidopsis Epi 8, a total of 3647 unique loci from DMGs and DEGs were used in the analysis to yield a PPI network formed by 430 genes. Functional enrichment analysis of these putative hub genes with the STRING database functional enrichment tool revealed a PPI network of 153 hub genes and associated functional networks (
Analysis of soybean epi-line data involved assignment of the Arabidopsis ortholog to each identified soybean gene with BLASTP and obtaining a soybean epi-line PPI network of 109 core hub genes and their functional networks (
To assess the relationship of PPI network output for epi-lines to graft-derived HEG, the core hub PPI network was constructed for the tomato HEG (state 3) phenotype. The network contained ribosome biogenesis, developmental process, and chromatin organization (
Epi-state comparisons in Arabidopsis reveal conserved msh1 epigenome targets within biologically meaningful gene networks. To investigate the relationship of genic methylation repatterning among the four distinct msh1-derived states, DMG overlap was assessed.
Components of the RdDM pathway were shown to be necessary for induction of msh1 state 1, transition from state 1 to state 2, and generation of state 3 following grafting. Methylome datasets that contrasted msh1 versus dcl2/dcl3/dcl4/msh1 quadruple mutant (state 1) or graft progeny from Col-0/Col-0msh1 versus Col-0/Col-0dc11/dcl3/dcl4/msh1 grafts (state 3) served to catalog DMGs as RdDM (sRNA)-dependent. These subtractive datasets confirmed that 674 (77%) of the 871 core DMG loci were predicted to be DCL2,DCL3,DCL4-dependent by obtaining the overlap between the 871 DMG core dataset and the msh1 vs dcl2/dcl3/dcl4/msh1 dataset.
Whereas 871 core DMGs were shared among the four msh1 states, the methylation changes were discovered within these 871 loci also served to discriminate the four states.
Transposable elements and sRNAs associate with candidate RdDM target loci among the 871 msh1 DMGs. The RdDM pathway is known to actively target transposable element (TE) sequences, prompting investigation of TE association with msh1-responsive loci. Looking at the 871 DMGs common between msh1 states, association of these loci was detected with TEs and sRNA (20-24 nt) clusters that was higher than genome-wide levels (61% DMGs within 2 kb of TE vs. 47% genome-wide; 77% DMGs within 2 kb of an sRNA cluster vs. 46% genome-wide). This enrichment for TE and sRNA cluster proximity increases further when the dcl2,dcl3,dcl4-sensitive DMGs (65% DMGs within 2 kb of TE; 81% DMGs within 2 kb of an sRNA cluster) were subset. Yet, when focused on the 67 hub DMGs derived by k-means clustering, only sRNA cluster enrichment was seen (49% of DMGs within 2 kb of TE; 72% DMGs within 2 kb of an sRNA cluster).
Some possible evidence was found of association between TE family and DMG proximity. Comparing the number of members in TE families between observed and expected revealed significant overrepresentation in L1 and Gypsy families and underrepresentation in the Helitron family, both in the 871 DMGs common between msh1 states and 674 DMGs sensitive to dcl2,dcl3,dcl4. Further investigation is needed to reveal any biological significance of these associations. Based on the various analyses, four criteria served to classify RdDM target loci in our study (
Detailed methylation analysis of selected RdDM target genes in the four different nongenetic states reveals sequence motifs encompassing dcl2/3/4-sensitive DMPs. Annotations of the 67 candidate hub loci supported their relevance to phenotype effects observed in the four msh1 states. In addition to gene networks for altered gene expression and chromatin behavior, major overlapping networks appeared to reflect the observed transition between stress response and growth (
Differential methylation analysis within the seven loci revealed evidence of state-specific repatterning (Tables 2 and 3). Changes in methylation at each locus were associated with identifiable sequence motifs that spanned approximately 14 nucleotides. Two sample composite motifs are shown in
Identified DMPs did not show an obvious pattern of exon, intron or junction localization, and each gene contained multiple motif sites. Evaluation of cluster motifs that encompass DMPs within the 67 msh1 core hub loci revealed, in many cases, evidence of high-order dependencies. Multiple sequence alignment (MSA) of a given DNA motif can reveal a dependence relationship between two nucleotides located at different positions within the motif, reflected in their frequencies of simultaneous occurrence. First-order dependence refers to adjacent nucleotides, typically found in CG methylation context, second-order to nucleotides spaced two nucleotides apart, and high-order to nucleotides with intervening distance of more than two nucleotides. The relationships derive from the study of Markov dependence in DNA sequences, the basis for application of hidden Markov modeling of motif findings.
For the motifs identified, individual consensus nucleotides were evident at variable distance from the target cytosine, which is nucleotide 7 (on the plus or minus strand) within each motif. For example, the motif from cluster 65 showed invariant T at position 14 and a consensus A at position 12, while the motif from cluster 66 showed invariant G at position 14 and an AG pair consensus at positions 2 and 3, respectively (
Small regions (14 bp) encompassing DMPs in at least three samples were identified and considered as DNA methylation motif candidates in the 67 identified msh1 core hub genes. A distance matrix was estimated on the set of selected regions using function dist.dna from ape R package (version 5.5). Hierarchical cluster analysis on the set of selected regions (using the previous estimated distance matrix) was accomplished with function hclust from stats R package (version 4.1.1) and grouped to 100 clusters. UPGMA approach was applied as agglomeration algorithm. Clusters with fewer than 10 regions where discarded. A DNA multiple sequence alignment on each cluster of sequences was accomplished with MUSCLE algorithm implemented on Bioconductor R package muscle (version 3.14). The motifs presented in
A rapid way to discover DNA sequence features associated with methylation is proceeding with the location of methylation sites across the samples. This approach relies on evidence that methylation takes places at specific, nonrandom sites (a sort of DNA sequence motif).
These data are included in the package:
Function signal peaks provides a potential DNA sequence motif with coordinates centered in the methylation peak covering 3-up and 4-down bp around the signal. It was requested that the methylation signal be present in at least three samples (cutpoint=3 L).
DNA multiple sequence alignment MUSCLE and hierarchical clustering are applied to identify clusters of DNA sequence motifs (
The cluster motifs that encompass DMPs on gene AT1G50030 (TOR) revealed evidence of high-order dependencies. That is to say, in a multiple sequence alignment (MSA) of a given DNA motif, a base Y at a given site k depends on base X at the preceding site j if high frequencies of bases X and Y are simultaneously observed in the MSA. In particular, if k−j=1, then k is a first order dependence typically found in CG methylation context; if k−j=2, k is a second order dependence, and when k−j>2, a k is a high order dependence.
For these motifs, individual consensus nucleotides were evident at variable distance from the target cytosine, which is nucleotide 7 (on the ‘+’ or the ‘−’ strand) within each motif. For example, the motif from cluster 1 (
To obtain the DNA sequence motifs by gene:
DSP can be applied to the previously obtained signals. Next, power spectra from signals from chromosome 2 and 3 are computed.
A function that encodes a previous detected binary signal of 0s and 1s from a DNA sequence into a numerical code defined by the user can be applied. Possible encodings can be binary number, real numbers, and complex numbers. The encoding of DNA methylated sequence using complex numbers is also supported with GWFs. Encoding using ordinary real number is supported as well. The basic idea is to encode the physicochemical properties of DNA bases. In this scenario, by applying different DSP tools, periodicities can be searched for and correlations on the encoded signal that target the superposition of methylation and physicochemical signals. Currently, the DSP analysis of complex signal with R is not good. However, the methylation signal can be encoded with GWF and then exported to, e.g., MatLab or Pythom, and to accomplish the DSP analysis there.
Given two objects, one carrying the signal and the other one carrying the DNA sequence, the function will perform the encoding set out by the user. The function can be used to re-code the previous detected binary signal of 0s and 1s from a DNA sequence into numerical code defined by the user. In particular, it is feasible to incorporate information on the physicochemical properties of neighboring DNA bases: the number of hydrogen bonds and the base chemical type. This can be the default used by this function:
So, each base will rise to a binary string:
Next, the signal is partitioned into intervals of non-overlapping windows of 90 bit (30 bp):
There are 592 potential DNA sequence motifs:
The overlaps can be searched between the motifs and the 30-bp wide sliding windows.
That is, 20 genomic word frameworks from 592 are fully within 30-bp regions on gene AT1G50030. However, only 3 significant motifs are covered by this region in the dwarf sample ‘dw2’.
The three motifs are embedded at the beginning of the signal region under scrutiny.
The sequences from cluster 12 & 9 (
The power spectral of the binary signal can be obtained from the SignalMatrix-class objects using function plot power_spectral (
The peak at ⅓ (0.33) indicates the regions under scrutiny are protein coding regions (
A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. In our case, the axis ‘time’ will be represented by bit or by single DNA base positions. The spectrogram can be obtained from any encoded signal region using function ‘spectrogram’ from the R package phonTools.
To prepare the methylation signal as a numerical signal:
Next, the spectrogram for the same gene from the control (
Methylation signal breaks down the power spectrum energy around the periodicity at about the ⅓ frequency (dashed line in
In this case, the methylation effect lies around bin 6-10, and the energy power is shifted down the ⅓ frequency (
The wavelet coefficients yield information on the correlation between the wavelet (at a certain scale) and the data array (at a particular location). A larger positive amplitude implies a higher positive correlation, while a large negative amplitude implies a high negative correlation.
Wavelet Power Spectrum provides a useful way to determine the distribution of energy within the data array. By looking in the plot for regions within the Wavelet Power Spectrum (WPS) of large power, one can determine which features of the signal are important and which can be ignored. Here, the term “energy” is not arbitrary but is borrowed from applications in human-built communication systems. The level of energy represented in the WPS is proportional to the energy dissipated in the transmission of a binary signal of the same size to a given receiver through a human-built communication grid.
Moreover, with the advance of information theory and its application to biomolecular processes, it is well known that to accomplish a single methylation change, every methyltransferase/demethylase must dissipate a minimal energy to process the information associated with the change. This energy is determined by Landauer's principle, according to which, a molecular machine must dissipate a minimum energy of ε=kBT ln2 (about 3×10-21 Joules per bit at room temperature) at each step in the genetic logic operations including proofreading.
Wavelet Power Spectral analysis of the previously estimated at_signal_diff dataset (
Some methylation motifs carry the same methylation status in the same regions from both groups, Col-0 and msh1 Dwarf. This is the case of regions: AT1G50030.3 (
In the case of AT1G50030.1 (
In the same way the correlogram can be computed based on WPS (
In addition to the expected correlation break at region 55-60 bit, the correlation between energy spectrum at regions 30-35 bit and 55-60 bit is lost (
The analysis of the methylation signal accomplished on 81 genes associated to human neural system development.
The whole set of DMGs derives from 751 DMGs, which were selected according to their contribution to patient classification into two groups: “typical” and “autism”. Concretely, the 751 DMGs contribute with more than 1% of the total variance to the main principal component from a PCA. These DMGs were analyzed with STRING Cytoscape App and the main sub-network of hub was identified applying K-means clustering approach using network centrality indicators as variables. All the DMGs were obtained with methylation analyses.
From the foregoing, it can be seen that the present disclosure accomplishes at least all of the stated objectives.
Unless defined otherwise, all technical and scientific terms used above have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the present disclosure pertain.
The terms “a,” “an,” and “the” include both singular and plural referents.
The term “or” is synonymous with “and/or” and means any one member or combination of members of a particular list.
The terms “invention”, “present invention”, “disclosure”, or “present disclosure” are not intended to refer to any single embodiment of the particular invention but encompass all possible embodiments as described in the specification and the claims.
The term “about” as used herein refer to slight variations in numerical quantities with respect to any quantifiable variable. Inadvertent error can occur, for example, through use of typical measuring techniques or equipment or from differences in the manufacture, source, or purity of components.
The term “substantially” refers to a great or significant extent. “Substantially” can thus refer to a plurality, majority, and/or a supermajority of said quantifiable variable, given proper context.
The term “generally” encompasses both “about” and “substantially.”
The term “configured” describes structure capable of performing a task or adopting a particular configuration. The term “configured” can be used interchangeably with other similar phrases, such as constructed, arranged, adapted, manufactured, and the like.
Terms characterizing sequential order, a position, and/or an orientation are not limiting and are only referenced according to the views presented.
“Phenotype” refers to the set of observable characteristics of an individual resulting from the interaction of its genotype with the environment.
“Epigenetic” relates to arises from nongenetic influences on gene expression.
An “R package” is an extension to the R statistical programming language. R packages contain code, data, and documentation in a standardized collection format that can be installed by users of R, typically via a centralized software repository such as CRAN.
“Expressivity” is the degree to which a phenotype is expressed by individuals having a particular genotype.
The “scope” of the present invention is defined by the appended claims, along with the full scope of equivalents to which such claims are entitled. The scope of the invention is further qualified as including any possible modification to any of the aspects and/or embodiments disclosed herein which would result in other embodiments, combinations, subcombinations, or the like that would be obvious to those skilled in the art.
This application claims priority under 35 U.S.C. § 119 to provisional patent application U.S. Ser. No. 63/323,690, filed Mar. 25, 2022. The provisional patent application is herein incorporated by reference in its entirety, including without limitation, the specification, claims, and abstract, as well as any figures, tables, appendices, or drawings thereof.
This invention was made with government support under Grant No. GM134056 awarded by the National Institutes of Health. The Government has certain rights in the invention.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US2023/064913 | 3/24/2023 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63323690 | Mar 2022 | US |