The instant application contains a Sequence Listing which has been submitted in XML format and is hereby incorporated by reference in its entirety. Said XML copy, created on Aug. 8, 2022, is named “058636.00536.xml”, and is 3,084 bytes in size.
Being able exploit genomic data to predict organismal outcomes in response to changes in nutrition, toxin and pathogen exposure could inform crop improvement, disease prognosis, epidemiology, and public health. To this end, machine learning methods have been developed and applied to infer phenotypes from genomic and epigenetic features associated with such conditions using changes in mRNA/protein expression levels, single nucleotide polymorphisms, chromatin modifications, and more. Despite the compelling motivation and cumulative efforts, accurately predicting complex phenotypic traits from genome-scale information remains both a promise and a challenge. Several factors contribute to these challenges. First, in contrast to the increasing availability of omics data, collection of high-quality phenotypic data from a genetically diverse population that adequately represents the phenotypic diversity space has become a major limiting factor1. In addition, phenotypic data is often collected from experiments that are distinct from those used to acquire the functional genomics data. To overcome these limitations, phenotyping efforts should be expanded and performed on the same materials that are the source of genetic/genomic information2. Furthermore, the explosion of omics data means that the features (e.g. numbers of genes) collected from a single experiment inevitably outnumber the phenotype space (e.g. sample size), leading to problems in data sparsity, multicollinearity, multiple testing, and overfitting3. This can be counteracted with increasing sample size, dimension reduction, or feature selection methods such as Principal Component Analysis (PCA), Least Absolute Shrinkage and Selection Operator (LASSO) regularization, Canonical Correlation Analysis (CCA), and so forth4. Additionally, cross-species approaches have been adopted in machine learning context to improve the performance of model-to-human knowledge translation5. Thus, there is an ongoing and unmet need to provide improved methods for analyzing genomic data to predict organismal outcomes in response to environmental changes, and use the results from the analysis to identify and modify genes to improve plant function. The present disclosure is pertinent to these needs.
The present disclosure addresses a number of previous challenges in identifying and modifying genes to improve plant function by using an evolutionarily informed machine learning approach that exploits genetic diversity both within and across species. We employ transcriptome data of nitrogen response genes to predict nitrogen use efficiency (NUE), an agronomic outcome critical for worldwide food safety and sustainability2,6. Nitrogen (N)—the main limiting macronutrient for plant growth—is supplemented in agricultural systems through application of N fertilizer. For major row crops such as maize (Zea mays), less than 40% of supplied N is taken up by the plants, while more than 60% of soil N is lost to the atmosphere or water bodies through multiple processes such as denitrification, ammonia volatilization, leaching etc7. Balancing the need to further increase crop yields, while also mitigating the environmental impacts associated with N fertilizer, is a challenge for sustainable agriculture. Considering the polygenic nature of NUE that involves the integration of developmental, physiological, and metabolic processes2, machine learning was applied as a strategy to tackle the mechanisms underlying this complex trait. To this end, we collected transcriptomic and phenotypic NUE data from two species—maize (a crop) and Arabidopsis (a model)—each of which included a panel of genotypes with diverse genetic background and NUE variation. We used genes whose response to N-treatments (N-DEGs) was conserved within and across species as a dimension reduction approach for machine learning. As maize and Arabidopsis are highly divergent phylogenetically, these evolutionarily conserved N-response genes should represent essential/core functions contributing to NUE. We show that models constructed using these evolutionarily conserved N-DEGs significantly improved the prediction of NUE traits from gene expression values, compared to an equal number of top ranked N-DEGs or randomly selected expressed genes. The inclusion of the model species Arabidopsis enabled us to validate using mutants. This evidence validated that the genes whose expression levels are important in predicting NUE in the machine learning models are more than just markers, but functionally required for the trait. Moreover, we show that the described evolutionarily informed machine learning pipeline is transferable to other species and traits in plants and animals. Specifically, application of the described method to other matched transcriptome and phenotype datasets related to drought in field grown rice or disease in mouse models resulted in enhanced prediction accuracies of the learned models. As such, the described evolutionarily informed machine learning pipeline has the potential to identify genes of importance for complex phenotypes of interest across biology, agriculture, or medicine.
A result of the described analysis identified maize genes that can be modulated to improve plant function. In particular, the present disclosure shows that expression of certain identified genes can positively affect nitrogen utilization and increase plant biomass, including but not necessarily limited to maize grain mass. As such, the disclosure provides for inhibiting the expression and/or function of one or a combination of transcription factors (TFs) described herein. In embodiments, the expression and/or function of hb75, alone or in combination with another described TF, such as nf-ya3, is provided for use in improving plant function.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Every numerical range given throughout this specification includes its upper and lower values, as well as every narrower numerical range that falls within it, as if such narrower numerical ranges were all expressly written herein.
As used in the specification and the appended claims, the singular forms “a” “and” and “the” include plural referents unless the context clearly dictates otherwise. Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by the use of the antecedent “about” it will be understood that the particular value forms another embodiment. The term “about” in relation to a numerical value is optional and means for example+/−10%.
This disclosure includes every amino acid sequence described herein and all nucleotide sequences encoding the amino acid sequence. Polynucleotide and amino acid sequences having from 80-99% similarity, inclusive, and including and all numbers and ranges of numbers there between, with the sequences provided here are included in the invention. All of the amino acid sequences described herein can include amino acid substitutions, such as conservative substitutions, that do not adversely affect the function of the protein that comprises the amino acid sequences. The disclosure includes all polynucleotide and amino acid sequences described herein, and every polynucleotide sequence referred to herein includes its complementary DNA sequence, and also includes the RNA equivalents thereof to the extent an RNA sequence is not given. Any sequence referred to by a database entry is incorporated herein by reference as the sequence exists in the database as of the effective filing date of this application or patent, including but not limited to database entries that are signified by an alphanumeric indicator that starts with “Zm.”
The disclosure includes all described methods of analyzing transcriptome data to predict a phenotype described herein, all machine learning approaches described herein that are used for analysis of gene expression changes using Nitrogen (N)-treatment that influences expression of N responsive genes (N-DEGs), and extensions of those approaches to different genes, their protein products, and interspecies comparisons of transcriptome analysis and predictions of the influence of transcription factors on any phenotype. In a non-limiting embodiment, the disclosure includes the process as depicted in
In embodiments, based at least in part on the described analysis, the present disclosure provides compositions and methods for modifying plants and/or plant cells. The compositions and methods relate to altering expression of one or a combination of the TFs. Altering the expression can result in any change in the plant described herein. In embodiments, practicing a method of the disclosure results in an increase in N uptake, increased biomass, such as increased grain biomass, an increased harvest index, an increased Total nitrogen utilization (NUtE), an increased total Grain NUtE, or a combination thereof. Non-limiting demonstrations of these effects are summarized in
Notwithstanding the foregoing description, the TFs of the present disclosure include any TF that is referenced in the description (including tables) or in the figures. Overexpression and underexpression of any one or combination of the described genes is included in the disclosure. Overexpression of a particular gene can be accomplished by any method known in the art. For example, a plant cell may be transformed with a nucleic acid vector comprising the coding sequences of the desired gene operably linked to a promoter active in a plant cell such that the desired gene is expressed at levels higher than normal (i.e., levels found in a control/nontransgenic plant). The promoters can be constitutively active in all or some plant tissues or can be inducible. The under-expression of a desired gene can be accomplished by any method known in the art. For example, a gene may be knocked out, or mutated such that lower than normal levels of the gene product is produced in the transgenic cells or plant. For example, such mutations include frame-shift mutations or mutations resulting in a stop codon in the wild-type coding sequence, thus preventing expression of the gene product. Another exemplary mutation is the removal of the transcribed sequences from the plant genome, for example, by homologous recombination. Another method for under-expressing a gene is transgenically introducing an insertion or deletion into the transcribed sequence or an insertion or deletion upstream or downstream of the transcribed sequence such that expression of the gene product is decreased as compared to wild-type or appropriate control. Additionally, microRNA (native or artificial) can be used to target a particular encoding mRNA for degradation, thus reducing the level of the expressed gene product in the transgenic plant cell. Another method for underexpression of a gene of interest is using clustered regularly interspaced short palindromic repeats (CRISPR) gene inactivation. A variety of suitable CRISPR systems for use in plants can be used, and include but are not necessarily limited to Cas3, Cas9, and Cas13 based systems, all of which are known in the art and can be adapted for the described purposes, such as by using a suitable CRISPR enzyme and guide RNA to target the described gene(s) and/or their regulatory elements, such as promoters.
The sequence of the protein encoded by maize nf-ya3 is:
The sequence of the protein encoded by maize hb75 is:
Those skilled in the art will recognize how to identify and modify DNA sequences that encode the described proteins based on the genetic code.
The described compositions and methods can be used for any type of plant, such as monocots, dicots, gymnosperms, or plant cells. The term “plant cell” as used herein refers to protoplasts, gamete producing cells, and includes cells which regenerate into whole plants. Plant cells include but are not necessarily limited to cells obtained from or found in: seeds, suspension cultures, embryos, meristematic regions, callus tissue, leaves, roots, shoots, gametophytes, sporophytes, pollen, and microspores. Plant cells can also be understood to include modified cells, such as protoplasts, obtained from the aforementioned tissues. In non-limiting embodiments, the method is used for any species of woody, ornamental, decorative, crop, cereal, fruit, or vegetable plant. The method can be used on intact plants, isolated plant parts, and plant cells. In embodiments, the method is used with a seed, a suspension culture, an embryo, a meristematic plant region, callus tissue, a leaf, a root, a shoot, a gametophyte, a sporophyte, pollen, a microspore, or a protoplast. In embodiments, the plant or plant cells that are modified according to the disclosure are any member of the following genera/group: Artemisia, Acorns, Aegilops, Allium, Amborella, Antirrhinum, Apium, Arabidopsis, Arachis, Beta, Betula, Brassica, Cannabis, Capsicum, Ceratopteris, Citrus, Coffea, Cryptomeria, Cycas, Descurainia, Eschscholzia, Eucalyptus, Glycine, Gossypium, Hedyotis, Helianthus, Hordeum, Ipomoea, Lactuca, Linum, Liriodendron, Lotus, Oryza, Lupinus, Lycopersicon, Medicago, Mesembryanthemum, Nicotiana, Nuphar, Pennisetum, Persea, Phaseolus, Physcomitrella, Picea, Pinus, Poncirus, Populus, Prunus, Robinia, Rosa, Saccharum, Schedonorus, Secale, Sesamum, Solanum, Sorghum, Stevia, Thellungiella, Theobroma, Triphysaria, Triticum, Vitis, Zea, or Zinnia. In non-limiting embodiments, the modified plant or plant cells are from one or more so-called “elite” varieties of maize. The disclosure includes seeds produced by any modified plant herein, and progeny of the plants and seeds. Articles of manufacture comprising the seeds and a container that contains the seeds are also provided. In embodiments, the articles of manufacture comprise kits.
The following Examples are intended but not limit the disclosure.
We analyzed whether the prediction power of machine learning models could be enhanced by exploiting the genetic diversity of gene responses and phenotypes both within and across species. In non-limiting embodiments, we tested whether using N-DEGs conserved both within and across species as a biologically-principled means of dimension reduction, could enhance identification of genes of importance to predicting NUE phenotypes from gene expression data across a model (Arabidopsis) and crop (maize) plant. This model-to-crop machine learning approach enables more rapidly validation of conserved features of importance to NUE in the crop using the model species.
Within each species, we selected a set of genotypes that exhibit a broad spectrum of phenotypic variation in NUE. The data included 18 Arabidopsis accessions that were previously identified for their NUE diversity8 which originated from a nested collection of 265 accessions found in a wide range of habitats differing notably in soil nutrient richness9. The 23 maize genotypes analyzed in this disclosure correspond to 12 maize inbred lines and their 11 corresponding hybrids with B73. We selected these 12 maize inbred lines to represent the phenotypic diversity for NUE traits that we measured among a population of 318 field-grown maize inbreds (
To test whether genome-wide responses to N-treatments evolutionarily conserved across the model and crop could be a biologically principled approach to enhance the model performance of predicting NUE, we constructed a three-step machine learning pipeline (
In the described phenotypic analysis, we quantified nitrogen use efficiency (NUE) as the efficiency of converting supplied N to biomass/grain yield. For Arabidopsis, NUE was calculated as the efficiency with which each plant converted supplied N into shoot biomass (NUE=Above ground dry weight/Applied N). This measure of NUE is achieved by providing each plant with a trackable/contained amount of N in pots in a lab setting, as a proxy for the field agricultural setting2. Indeed, we found the Arabidopsis accessions previously selected for NUE diversity8 present a broad range of NUE variation in our own experiments, as evidenced by the coefficient of variation (CV=0.58) (
For field-grown maize, we used Total NUtE, (stover biomass+grain biomass)/(stover N content +grain N content), as the target trait (
ANOVA results revealed that 55% of the total NUtE variation in this maize experiment was attributed to genetic effects (
Evolutionarily conserved transcriptome response to N-treatment used for feature reduction in machine learning
Feature reduction is an essential pre-processing step in machine learning, as too many irrelevant features may interfere with prediction performance3. Given the fact that the N level is a significant factor explaining NUE variation in both Arabidopsis and maize (
The resulting conserved N-DEGs from Arabidopsis (n=610) were used as gene features in the machine learning model (
We then analyzed whether the expression levels of N-DEGs conserved across model and crop species could enhance identification of NUE phenotypes—compared to non-selected genes—using machine learning algorithms. This data-driven hypothesis is supported by the fact that: i) the expression levels of N-DEGs have been used as biomarkers of N status across maize genotypes19, and ii) the described phenotypic data shows that N level is a significant factor explaining the NUE variation in both maize and Arabidopsis (
Evolutionarily Conserved N-Responsive Genes have Enhanced Predictive Power in Machine Learning
For each species, we used the gene expression values (N-DEGs) as features (also referred to as gene features) to predict NUE traits through XGBoost regression models. XGBoost13 is a implementation of the gradient boosting algorithm20, that uses a boosting algorithm to combine multiple weak learners, i.e. shallow trees, into a strong one (
For maize, using the N-DEGs (n=248) conserved with their Arabidopsis homologs, resulted in a mean Pearson's correlation coefficient r of 0.79 for the XGBoost models predicting NUE across 16 maize lines (
The described analysis showed that the overall predictive performance of learned models that used the evolutionarily conserved maize N-DEGs is significantly better than that obtained using the same number of top-ranked N-DEGs with the lowest P-value (0.68, Mann-Whitney U test P-value=1.06E-3), or ones randomly selected from total expressed genes (0.62, Mann-Whitney U test, P-value=1.5E-5) (Table 1). In addition, comparison of the feature importance score, an XGBoost13 output which reveals the influence of each feature (gene) in the predicted value (NUE)13, with the P-value in DEG analysis, uncovered only a weak correlation (Spearman's rank correlation coefficient rho=0.19,
In parallel, we used the Arabidopsis N-DEGs (n=610) whose N-response is conserved with their maize homologs, as the features to predict NUE in the same XGBoost machine learning pipeline (
The described results from both maize and Arabidopsis data show that using the evolutionarily conserved N-responsive differentially expressed genes significantly improved performance of the machine learning models predicting NUE significantly, and that this improvement is not due to a simple numerical reduction in the gene features (Table 1). Furthermore, the weak correlation between the XGBoost-based feature importance ranking and the edgeR-based P-value ranking (
To further test whether our pipeline can be applied to predict additional traits from transcriptome data, we used the same conserved N-DEGs (
We also applied the described evolutionarily informed machine learning pipeline to two additional matched transcriptome and phenotype datasets related to drought in field grown rice and disease response in mouse models.
The rice data comprises matched transcriptomic and phenotypic information collected from 220 rice genotypes subjected to drought treatment in field experiments23. The 220 rice genotypes consist of two major subspecies, Indica and Japonica, which diverged ˜440,000 years ago, with the genotypic and phenotypic diversity of domesticated rice. From this large dataset, we retained 57 rice genotypes that had no missing data in the trait measurement. We then used this set of 57 rice genotypes, and randomly selected 20 genotypes to define drought-responsive DEGs and used them as gene features for predicting the fecundity in the 37 “left-out” rice genotypes. We repeated this process 10-times and the mean Pearson's r was 0.62. The model performance was consistent across the evolutionarily distant Japonica and Indica rice sub-species (
The mouse dataset comes from a highly genetically diverse Collaborative Cross (CC) population that comprises 90% of the genetic diversity across the entire laboratory Mus musculus genome24. The dataset we selected comprises matched transcriptome and disease outcome after influenza virus infection of 11 genotypes from the CC mouse population study24. We used DEGs (mock vs. infected) identified across the 11 mouse CC population genotypes to predict the disease outcome (asymptomatic vs. symptomatic) and found the mean Pearson's r to be 0.98. The models built using cross-genotype DEGs outperformed the model using the same number of random expressed genes (Mann-Whitney U test, P-value=3.3E-3).
Overall, the results for the matched transcriptome and phenotype datasets for the rice and mice models provide two use-case studies of evolutionarily informed machine learning pipeline applied to external data sets for traits in both plants and animals. They also show that transcript-based prediction can be achieved using a smaller population (20 and 11 genotypes in the case of rice and mice respectively), compared with the requirement of hundreds of lines which are needed for GWAS and eQTL studies25.
The Examples above established the robustness of the evolutionarily informed machine learning models in predicting trait outcomes based on conserved gene responses within and across species. Next, we experimentally validated gene features that are most influential in our predictive models. To this end, we used the feature importance score, an XGBoost13 output which reveals the influence of each feature (gene) in the predicted value (NUE). We reasoned that if models built for multiple genotypes selected a common set of gene features, this would indicate that those gene features are robust to genotype in predicting NUE. In maize, over 81% (202/248) of the XGBoost “important gene features” for predicting NUE were shared by models built for 16 genotypes, and 91% (245/248) were shared by 10 or more maize genotypes. Similarly, for Arabidopsis 42% (257/610) of the “important features” for predicting NUE were shared by models built for 18 Arabidopsis accessions, and 85% (519/610) were shared by 10 or more Arabidopsis accessions. These results are not only consistent with the polygenic nature of NUE trait, but also reveal that there is a core set of influential N-DEGs whose expression levels can accurately predict NUE phenotypes for both species.
In maize, the top-ranked “important gene features” in predicting NUE outcomes include the transcription factors (NLP, MYB, WRKY), members of N-uptake/assimilation pathway (ammonium transporter, asparagine synthetase), and genes involved in photosynthesis and amino acid metabolism (
Further, we reasoned TFs controlling the levels of expression of multiple XGBoost important features for predicting NUE would be candidates for functional validation for their role in NUE in planta. To this end, we identified TFs predicted to regulate these XGBoost gene features of importance to NUE by constructing gene regulatory networks (GRNs) using GENIE3, which adopts the random forest machine learning algorithm and was the best performer in the DREAM4 and −5 Network Inference Challenge14.
To construct GRNs controlling NUE for each species, we first identified the N-responsive TFs in maize (545 TFs) and Arabidopsis (184 TFs) by intersecting the N-DEGs in this disclosure with the TFs for each species using published databases30-32. Next, we used our N-response TFs in GENIE3 as the “regulatory genes” (GENIE3 term) whose influence on the evolutionarily conserved “target genes” in maize (248 gene features) or Arabidopsis (610 gene features) were weighed on a 0 to 1 scale, where 0=non-influential and 1=strongly influential. We kept the top 1% of the TF-target edges to construct the NUE regulatory network and calculated the number of TF-target edges (connectivity) for each TF as a measure to evaluate their influence within the GRN.
Next, we integrated our GRN analysis with the XGBoost results to select candidate TFs that regulate genes of importance to NUE phenotype for functional validation of their role in NUE (Table 2). The selection and prioritization of TFs was based on one or more of the following criteria: i) XGBoost-based importance score, ii) GENIE3-based TF connectivity in the NUE GRN, iii) curated knowledge from the literature, and iv) the availability of multiple mutant alleles. In Arabidopsis, the top TFs in the XGBoost-based importance ranking listed in Table 2 include NF-YA6 (AT3G14020), D1V1 (AT5G58900), UNE12 (AT4G02590), NLP5 (AT1G76350), and TCP2 (AT4G18390). The other two Arabidopsis TFs prioritized for in planta validation studies WRKY38 (AT5G22570) and WRKY50 (AT5G26170) (Table 2), were selected based on their high connectivity in the GENIE3-based GRN. For maize, we selected two candidate TFs (Zm00001d006293 nlp17, Zm00001d012544 myb74) for in planta validation studies that are hubs in the GENIE3-based GRN. Since no maize mutants were available for these genes, we took advantage of our cross-species approach by validating the function of their Arabidopsis homologs (AT1G76350 NLP5, AT5G06100 MY833) in NUE. With the goal of cross-species validation, we also selected the maize homolog (Zm00001d006835, nfya3) of the top-ranked Arabidopsis NF-YA6 (AT3G14020) for validation in NUE (Table 2). This choice took into consideration the fact that NF-Y transcription factors are enriched in Arabidopsis XGBoost gene features and in the maize GRN. Moreover, this selection was supported by previous studies which showed that overexpressing a member of the NF-YA family in wheat significantly increased N uptake and grain yield under different levels of N supply33. To discern the function of maize NF-Y homologs in NUE, we characterized the nfya3-1::UfMu mutation with a Uniform Mu transposon insertion (mu1003041)34 that does not produce a detectable full-length transcript.
Our results on the eight Arabidopsis TFs selected for in planta validation studies were classified into two groups based on our NUE phenotypic results (
Taken together, the described evolutionarily informed machine learning predictions of genes of importance to NUE and validation results for TF mutants for both Arabidopsis and maize demonstrate that: i) Using evolutionarily conserved gene response significantly enhances the ability of the XGBoost machine learning models to predict NUE outcome across genotypes and species (plants and animals), and ii) The XGBoost-based important scores and GENIE3-based connectivity are informative in selecting functionally important features—including TFs—to control of a complex physiological trait in crops— NUE—which has important implications for sustainable agriculture.
It will be recognized from the foregoing Examples that the disclosure described a new genome-to-phenome analysis—namely, predicting phenotypic outcomes from genome-wide expression data. We show that exploiting evolutionary conserved gene expression datasets—within and across species—enhanced the machine learning model performance in predicting NUE phenotypes in a model (Arabidopsis) and a crop (maize), and also as applied to published matched transcriptome/phenotype datasets from another crop (rice) and model animal (mouse).
Our evolutionarily informed three-step machine learning pipeline (
The implementation of machine learning in predicting phenotypes has advanced in the past few years. However, the available datasets do not always; 1) exploit the genetic diversity of the organism(s) and 2) measure the phenotypes using same samples from which the transcriptome response was captured. The present disclosure advances the field in both points, as we utilized a panel of genotypes with diverse genetic backgrounds and measured phenotypes from the same batch of plants that the transcriptome was captured. We integrated genetic diversity, machine learning, and cross-species approaches to identify genes of importance to an agronomically important trait, NUE. The trait we selected for study on NUE has the challenge of its underlying polygenic nature and the difficulty in collecting high quality phenotypic data36. To this end, we designed a sufficiently large experimental space of N-treatments across a set to ˜20 genotypes spanning NUE phenotypes in a model and crop species. The described results represent the largest matched phenotypic and transcriptomic datasets from both a model and a crop species. This dataset includes a large NUE phenotypic dataset resource of 318 maize genotypes for the plant community, and for 18 Arabidopsis accessions. We analyzed the genetic diversity in 18 Arabidopsis accessions and 23 maize genotypes selected for broad phenotypic variation in NUE and scored them for both transcriptomic and physiological responses in the same samples. Importantly, the selected maize genotypes represent the range of NUE diversity observed among a comprehensive collection of germplasm adapted to the U.S. Corn Belt, as confirmed empirically (
To extend this analysis beyond NUE, we applied our evolutionarily informed machine learning approach to other agricultural traits (e.g. drought resistance) in another major crop, using published transcriptome and phenotype datasets of genetically diverse rice subspecies (Indica and Japonica)23. In our application to animals, we exploited the growing awareness that host genetic variation has a major impact on pathogen susceptibility. To this end, we used matched transcriptome and phenotype data from a highly genetically diverse Collaborative Cross (CC) population that comprises 90% of the genetic diversity across the entire laboratory Mus musculus genome24. Models that we built using cross-genotype DEGs from both these studies of these genetically diverse lines in plants (rice) and animals (mice) lines, significantly outperformed the model using the same number of random expressed genes. Importantly, in these two additional case studies, and in our proof-of-principle example, our evolutionary informed analysis of matched transcriptome and phenome data allowed us to use a considerably smaller sample size compared to those needed for GWAS or eQTL studies25.
By providing accurate prediction, the predictive models reveal novel gene features for further investigation of causality37. We demonstrate this principle using a reverse genetics approach to validate the function of eight transcription factors important to predicting NUE outcomes (Table 2). Notably, our two-way cross-species validation strategy enabled us to verify the function of genes involved in NUE for i) two maize candidate genes using mutants in their Arabidopsis homologs and ii) one Arabidopsis candidate TF via analysis of a mutant in its maize homolog grown in the field (Table 2,
The learned model performance is more robust to maize genotype, compared with the models learned in Arabidopsis (
The disclosure reveals that genes affecting NUE are involved in an array of processes (Table 2), including nutrient response and uptake (DIV140 and NLP519,41), anther and pollen development (NF-YA642 and MYB3343), juvenile-to-adult transition (MYB3344), microRNA-mediated growth and responses (NF-YA45, MYB3344, TCP246), immune response (NF-YA642, UNE1247, WRKY3848, and WRKY5049), and photomorphogenesis (TCP250 and Zm00001d00683551). These results not only provide additional evidence supporting the notion that NUE is a polygenic trait and intertwined with diverse signaling pathways, but further reveal a novel role of these genes in regulating NUE. Notably, there are three transcription factor families, NF-Y, NLP, and WRKY, whose members are enriched as the gene features of XGBoost models and/or the regulators of GENIE3-based GRN.
Our results identified nine Arabidopsis and one maize NF-Y genes as the features in XGBoost models, as well as 12 Arabidopsis and 14 maize NF-Y genes, as potential regulators in the GENIE3 NUE GRN. Moreover, we validated the function of NF-YA6 in NUE—a top gene in Arabidopsis XGBoost model —using mutants in Arabidopsis NF-YA6 (AT3G14020), as well as its maize homolog nfya3 (
We identified six Arabidopsis and two maize NLP genes as the features in XGBoost models to predict NUE, as well as five Arabidopsis and 14 NLP genes as potential regulators in the GENIE3 NUE GRN. Further, using mutants, we validated the role of NLP5—a top gene feature in maize XGBoost model and maize NUE GRN—as a negative regulator of NUE specifically under low-N conditions (
We identified six Arabidopsis and six maize WRKY genes as the features in XGBoost models, as well as 24 Arabidopsis and 11 WRKYgenes as the regulators in GENIE3 NUE GRN. Among them, WRKY38 and WRKY50 are the top-ranked TF hubs in the Arabidopsis NUE GRN. Our functional analysis using Arabidopsis mutants validated a role of WRKY38 and WRKY50 in mediating NUE (
The disclosure demonstrates that the integration of genetic diversity, cross-species transcriptome analysis and machine learning method enhances predictive modeling of genes affecting NUE. The results from reverse genetic analysis further show that those genes predictive of NUE are not only biomarkers but are functionally important in determining plant performance in response to environmental nutrition. The pipeline described herein could complement current approaches in identifying important genes in a multigenic trait. Our validation of the evolutionarily informed strategy for feature reduction across both genetically diverse crop and animal datasets, supports its potential to inform any system that seeks to uncover important genes controlling a complex phenotype in biology, agriculture, or medicine.
This Example describes the materials and methods used to produce the described results.
All Arabidopsis seeds used in this disclosure were obtained from ABRC. The 18 Arabidopsis accessions are Akita, B1-1, Bur-0, Col-0, Ct-1, Edi-0, Ge-0, Kn-0, Mh-1, Mr-0, Mt-0, N13, Oy-0, Sakata, Shandara, St-0, Stw-0, and Tsu-0, as previously studied for NUE8. The T-DNA mutants are all in the Col-0 background. The mutant lines63 are myb33-1 (SALK_056201), myb33-2 (SALK_065473), tcp2-2 (SALK_060818), une12-1 (SAILseq_711_E09.1), n1p5-1 (SALK_055211), n1p5-2 (SALK_063304), nfya6-1 (SALK_005942), nfya6-2 (SAIL_159_E03), wrky38-1 (WiscDsLox489-492C21), wrky38-3 (SAIL_749_B02), wrky50-1 (SAIL_115_C10), div1-1 (SALK_056735), and div1-2 (SALK_084867C). The mutants were genotyped to confirm the homozygosity. The expression level of the inserted gene in the homozygous mutants were below detection limit of real-time PCR (
For growth experiments, the Arabidopsis seeds were germinated on ½ MS with MES Buffer and Vitamins (RPI cat M70800) plates for 7-10 days in on a 16h-light/8h-dark photoperiod. The seedlings were then transferred to pre-washed nutrient-poor matrix vermiculite under an 8 h light (120/μmol2/s)/16 h dark diurnal cycle, at temperatures 22 and 20° C. respectively and 40% humidity. We kept one plant per pot and carried out the entire experiment using Arasystem (https://www.arasystem.com/). To track the N supply for each plant, we treated each plant with the same amount of low N (LN, 2 mM KNO3) (Sigma cat P6083) or high N (HN, 10 mM KNO3) medium (Caisson Labs cat. no. MSP10) using a syringe and recorded the volume. The potassium concentration was maintained by supplementing KCl (Sigma cat P9333) to the LN medium. On 40 and 42 DAS, the treatment was enriched with 10% atom excess 15N for 15N influx analysis. To minimize the variation due to pot location in the growth chambers, the HN row was located adjacent to the LN row, and the flats were shuffled three times weekly. We repeated these experiments three times consecutively to obtain biological replicates for phenotypic and transcriptomic samples. For each of the 18 Arabidopsis accessions, mature leaves were harvested for transcriptome and the above ground tissues for physiological traits at 43 DAS. The dried tissues were ground and analyzed for total nitrogen using a PDZ Europa ANCA-GSL elemental analyzer interfaced to a PDZ Europa 20-20 isotope ratio mass spectrometer at UC Davis Stable Isotope Facility.
Seeds for all maize inbreds used in this disclosure were originally obtained from the USDA-ARS North Central Plant Introduction Station in Ames, Iowa, except for the inbreds derived from the Illinois Selection Experiment and FR1064 as described in Uribelarrea et al22. Inbred lines were subsequently increased by controlled self-pollination, and hybrid seed produced by controlled crosses. We grew the maize plants in N-managed field plots in Urbana, Ill. between May and September in 2014-2016. The soil type is a Drummer silty clay loam, pH 6.2, that received either 200 kg/Ha fertilizer N or no exogenous applied N when the plants reached the V3 growth stage. Subsequent soil testing and measures of plant N recovery estimate approximately 60 kg N/ha were made available from the soil alone. The N fertilizer was applied as granular ammonium sulfate banded adjacent to plants at the soil surface. Plants were grown in a split-plot design where individuals in each main plot (2 rows 5.3 m long, 76 cm row spacing) were paired in adjacent rows of N-replete or N depleted condition to a final density of 49,000 plants per hectare for inbreds and 77,000 plants per hectare for hybrids. Genotypes within main plots were arranged by relative maturity to minimize its impact on NUE traits. Plots were maintained weed free by a pre-plant application of herbicide (atrazine+metalochlor) followed by hand weeding as needed.
Maize phenotyping was performed at the R6 growth stage, when plants have reached physiological maturity, but may not yet have fully senesced. Five plants from each plot were cut at ground level, ears removed, and a fresh weight obtained on the entire remaining plant material (stover, comprising mostly stalk by weight, followed by leaves, tassels, and husks). The stover was then shredded in a Vermeer wood chipper, a subsample was collected into a tared cloth bag, and the subsample fresh weight was recorded. Stover samples were oven-dried to dryness at least three days at 65° C. and the subsample dry weight used to estimate stover biomass. The dried stover was further ground in a Wiley mill to pass through a 2 mm screen, and approximately 100 mg used to estimate total nitrogen concentration by combustion analysis with a Fisons EA-1108 N elemental analyzer. Grain samples were dried for approximately one week at 37° C., after which grain was shelled from the cobs, and the cob weight recorded. The moisture content and N concentration within each 5-plant grain sample was estimated using near-infrared reflectance spectroscopy on a Perten DA7200 analyzer, using a custom calibration built with samples possessing a broad range of variation in composition and color. The nitrogen concentration calibration was established using data from total combustion analysis of grain samples as described above for stover.
The nfya3-1::Mu loss-of-function allele was generated by the UniformMu insertion mu1003041::Mu in the 5′ untranslated region the annotated gene model Zm00001d006835. The UFMu-00332 seed stock was obtained from the Maize Genetics Cooperation Stock Center and genotyped64 to identify homozygous for the nfya3-1::Mu mutant allele, which were then self-pollinated. The expression level of the nfya3 gene in the homozygous mutants was below detection limit of real-time PCR (CT>45) (
For each of three Arabidopsis RNA replicates, we harvested mature leaves from pre-bolting plants on 43 DAS between 9 and 11 AM from two plants, flash froze in liquid nitrogen and stored in −80 C. We isolated RNA using Direct-zol RNA Kits following manufacturer's instructions (Zymo Research). RNA quality was assessed on an Agilent Tape station using RNA ScreenTape (Agilent cat 5067-5576). All 108 stranded RNA-seq libraries were made using the NEBNext® Ultra™ II Directional RNA Library Prep Kit for Illumina® (NEB cat E7768) and assessed using DNA high sensitivity D1000 ScreenTape system (Agilent cat 5067-5584). The RNA-Seq libraries were sequenced using Illumina HiSeq 2500 v4 with 1×75 bp single-end read chemistry at the GenCore Facility at New York University Center for Genomics and Systems Biology.
For each of three maize RNA replicates, we collected leaf tissues from two inches from the base of leaf 13 subtending the top ear at R1 stage between 9 and 11 AM, flash froze in liquid nitrogen and stored in −80 C. We extracted RNA from frozen leaf tissue using CTAB-chloroform method. Genomic DNA was removed using DNAse I (NEB cat M0303). RNA-seq libraries were prepared using a TruSeq Stranded mRNAseq Sample Prep kit (Illumina cat RS-122-2101) according to the protocol provided. Single-end 150 bp reads were generated using the Illumina HiSeq 4000 at the Roy J Carver Biotechnology Center in the University of Illinois at Urbana-Champaign.
All RNA-seq raw reads were processed using the same pipeline to remove optical duplicates (Clumpify 37.24) and adapters (BBDuk 37.24)65. The trimmed reads were aligned to the latest genome in 2018, TAIR1066 for Arabidopsis and Zm-B73-REFERENCE-GRAMENE-4.012 for maize, using BBMap (37.24). The mapped reads were assigned by featureCounts (1.5.1)67 using the latest annotation in 2018: Araport1168 for Arabidopsis and AGPv4.3212 for maize. The parameters and software versions for the above steps are available in GEO accession GSE152249. We identified N-DEGs in the training data set (n-1 genotypes) and repeated n times (n=number of genotypes in each species). In each round of analysis, we first filtered out the lowly expressed genes (CPM>1 in less than 10 samples) and normalized the data using upper-quantile (EDASeq 2.18.0)69 and replicate samples (RUVSeq 1.18.0)70. Subsequently, we used edgeR (3.26.8)17 to detect genes differentially expressed in high vs low N condition across genotypes (FDR <0.05). Lastly, we intersected the n lists of DEGs and only retained the ones occurring on n lists as a common set of N-DEGs. These analyses resulted in 2,123 Arabidopsis N-DEGs and 6,914 maize N-DEGs (
We held out a testing genotype before the DEG stage; and only training genotypes (n-1 genotypes) were used in DEG analysis and XGBoost models. The held-out test genotypes were then used to validate the model performance. This round robin approach (
To rule out the possibility that using the intersected DEGs (e.g. within species) would overly optimize the XGBoost results, we further compared the XGBoost performance using the intersected DEGs (
We used a tree model with gradient boosting, XGBoost13 R implementation, to train and test the models. For each species, we split the data into training (n-1 phenotypes) and testing (left-out genotype) sets. We used five-fold internal cross-validation to select the optimized hyperparameters. We tuned “nrounds” (number of trees), “colsample_bytree” (the proportion of features for constructing each tree), “subsamples” (the portion of training data samples for training each additional tree), and “eta” (shrinkage of feature weights to make the boosting process more conservative and prevent overfitting) in an XGBoost:regression model. Subsequently, we made predictions on each of the left-out genotype, assessed the model accuracy by calculating the Pearson's correlation coefficient r between the predicted and actual values71, and reported the r from 100 iterations.
We used two parallel procedures to select candidate genes for functional validation. First, we used the XGBoost-generated feature importance score that indicates how useful each feature was in the construction of model. We summed the score on a gene-by-gene basis from 18 models for Arabidopsis and 16 models for maize and generated a ranked list. Second, we used a Random Forest-based algorithm GENIE3 to infer the transcription factors regulating the gene features. We used the N-responsive TFs (184 Arabidopsis TFs and 545 maize TFs) as the regulators and the gene features (610 Arabidopsis genes and 248 maize genes) as the targets and kept the default parameters. We constructed the NUE regulatory network using the top 1% of the edges and ranked the TFs based on their connectivity (number of edges).
References—This reference listing is not an indication that any particular reference is material to patentability.
Arabidopsis
ARABIDOPSIS
THALIANA
Arabidopsis
ARABIDOPSIS THALIANA
ARABIDOPSIS THALIANA
Arabidopsis Gene
THALIANA GIBBERELLIN 2-
thaliana salicylic acid
ARABIDOPSIS
THALIANA PROLINE
thaliana salicylic acid
THALIANA CHLOROPLASTIC
ARABIDOPSIS
THALIANA ORTHOLOG OF
THALIANA HEXOKINASE 2
THALIANA CELLULOSE
THALIANA CELLULOSE
ARABIDOPSIS
THALIANA PROLINE
ARABIDOPSIS THALIANA
thaliana ATP-binding
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
This application claims priority to U.S. provisional application No. 63/232,060, filed Aug. 11, 2021, the entire disclosure of which is incorporated herein by reference.
This invention was made with government support under Grant Number IOS-568 1339362, awarded by the National Science Foundation, and Grant Number 1013620, awarded by the United States Department of Agriculture. The government has certain rights in the invention.
| Number | Date | Country | |
|---|---|---|---|
| 63232060 | Aug 2021 | US |