The present invention relates to methods for determining phenotypic genomic estimated breeding values. The present invention also relates to a method for selecting a genotype for producing an improved plant in a selected environment, as well as a method for producing an improved organism.
Efficient and consistent crop production is a world-wide challenge. The field of terrestrial agriculture is relied upon to produce vast supplies of the world's food and medicinal products and textiles. Management of the economics, logistics and sheer scale of agricultural output is a considerable undertaking. However, the world's human and animal population continues to grow and therewith demand for agricultural products, against the constant challenges faced by farmers in the production itself. These challenges include for example the inherent susceptibility of crops to climatic conditions, and many other abiotic and biotic stresses, such as invertebrate pests and microbe and viral crop infections. While there is no one solution to all of these issues, there are significant gains to be achieved from improvements in any one area, one of which in particular is the susceptibility of crops to climatic conditions.
Neither of genetic modification nor migrating a genotype or cultivar from one location to another will necessarily result in an improved crop or even one which gives suitable economic production, let alone consistently from season to season. The performance of a crop is a result of the interaction of its genotypes with the environment at that location. Certain genotypes will interact with a given environment differently and one may out-perform or under-perform as compared to another, but rarely is there a genotype that performs equally well in all environments. It is a goal though, of plant breeders, to develop high-yielding cultivars with low genotype×environment interaction (GEI) in the hopes of achieving stable cultivar performance across environments.
Traditional attempts to identify high-yielding cultivars across environments are simply through trial and error. It is immediately apparent that this is an extremely long, laborious and inefficient process; it involves planting crops comprised of different genotypes in different locations and diligently monitoring the performance indicators and environmental conditions year upon year.
As such, there are significant potential advantages to be gained from circumventing this process and identifying economically productive—if not the most likely productive—combinations. Even reliable calculations that can be rapidly obtained in comparison, say to just exclude those combinations of poorest performance, would still give a significant advance.
One method which assists includes the visualisation of multi-environment trials by using a GGE biplot (Yan 2000). In this analysis, the environment main effect is removed, while the genotype main effect, as well as the GEI effect, are integrated after singular value decomposition analysis (Yan and Kang 2002). The first two or three principal components (PC) of the GGE analysis, which explain the largest proportion of the genotype plus GEI variations, are usually plotted with the environmental coordinates in a single biplot. Such biplots can be useful to infer the stability of different genotypes and to inform plant breeders about superior cultivars for different mega-environments (Yan 2001).
However, this method has its limitations. Since GGE biplots depend only on two or three PCs, a considerable proportion of the genotype and GEI variation is ignored (Gauch et al. 2008). This issue is especially critical for datasets with many heterogeneous environments in which the first few PCs only explain a small proportion of the total variation (Yang et al. 2009). These PCs can also be biased if a specific mega-environment is under-represented in the dataset. Moreover, because GGE biplots utilise only phenotypic data, new crosses cannot be compared to previous biplots, and genomic components affecting traits cannot be elucidated.
Other calculation methods have been developed in recent times to this end. For example, standard genomic best linear unbiased prediction (GBLUP) uses genomic relationships to estimate the genetic merit of an individual based on a genomic relationship matrix estimated from DNA markers. The matrix defines covariance between individuals based on observed similarity at the genomic level, though it is used mostly in livestock production. Attempts to impart genomic selection (GS) statistical models on plant calculative methods have also been made. Current GS statistical models exploit genetic correlation among different environments to model GEI and to produce more accurate genomic estimated breeding values (GEBVs) (Burgueño et al. 2012; López-Cruz et al. 2015; Cuevas et al. 2016, 2017); other models consider environmental covariates to improve the calculation accuracy for multi environmental trials (Jarquin et al. 2014; He et al. 2019). However, there remain drawbacks associated with these methods, primarily that they are inaccurate, but also computationally inefficient, they also typically require positively correlated environments for best implementation and are unable to calculate new environments not included in the reference population.
There exists a need to overcome, or at least alleviate, one or more of the difficulties or deficiencies associated with the prior art.
In one aspect of the present invention there is provided a method for determining phenotypic genomic estimated breeding values (pGEBVs), wherein the method comprise the steps of:
By “genetic data” is meant information relating to the DNA and/or RNA nucleotide sequence of an organism, preferably DNA. Preferably, the genetic data includes information relating at least to one or more taxonomic markers, being a region of DNA which allows for genetic distinction of genotypes by the presence of polymorphisms when two or more are compared.
Preferably, the genetic data includes a whole or substantially whole genome, which may be at least about 80%, 85%, 90%, 95%, 98% 99% or 100% of a complete DNA sequence.
By “environment” is meant as a set of conditions in which an organism may live. For example, in the context of a plant or animal, it may include climate (e.g. air quality, humidity, temperature, wind), soil, geographic location, light exposure, feed availability, water availability, biotic (e.g. pests and diseases which may be insects and pathogen infection) and abiotic stress (e.g. water or nutrient deficit) conditions as appropriate, that may impact on plant or animal behaviour. By “environmental data” is meant information relating to the environment in which an organism lives. The data may be qualitative and/or quantitative. In a preferred embodiment, in the context of a plant, the environmental data includes at least watering conditions, in terms of amount and/or means, e.g. rainwater or irrigated.
By a “phenotype” is meant an observable characteristic generally resulting from the interaction of a genotype and an environment. A phenotype encompasses a “trait” which refers to an associated underlying physiological or biochemical characteristic. A phenotype of a genotype may be as distinguishable from that of another genotype. By “phenotypic data” is meant information relating to one or more phenotypes of a genotype. The data may be qualitative and/or quantitative. In preferred embodiments, the phenotypic data, in the context of a plant, relates to a growth condition, and preferably yield.
By a “genotype” is meant a putative identifier assigned to an organism within a species to distinguish it from others of that species. Genotypes are often assigned based on an analysis of the genetic makeup of an organism, and generally in terms of that genetic makeup being capable of contributing to the expression of a phenotype which may be distinguishable, and/or based on a breeding or other genetic manipulation method upon observation of a distinguishable phenotype, as compared to others. For clarity, a genotype encompasses a haplotype which is an identifier assigned to an organism based on the makeup of a heritable genetic subregion (e.g. a gene, loci, group thereof); again that is generally capable of giving rise to the expression of a phenotype which may be distinguishable as compared to others. In the context of a plant, “cultivar” is synonymous with “variety” and is a plant or collection thereof comprising a single genotype or a group of selected genotypes.
By a “population of organism genotypes” is meant a number sufficient to allow their comparison in the method described herein. The population will generally be of a size to allow statistically meaningful analyses and may be tens to many hundreds or many thousands, generally limited in size by the obtainable genetic, phenotypic, and environmental data. In preferred embodiments, the population is at least 200 genotypes. Also in preferred embodiments, the population is across a mega-environment; the genetic, phenotypic, and environmental data obtained for the population of genotypes is from a mega-environment. By a “mega-environment” is generally meant a cluster of geographical regions that have a reasonably homogenous environment in which most genotypes behave similarly across regions. In a preferred embodiment, a mega-environment may be at least two geographical regions. Also in preferred embodiments, the population is across a plurality of mega-environments; the genetic, phenotypic, and environmental data obtained for the population of genotypes is from more than one mega-environment, preferably wherein the mega-environments differ in their conditions.
By a “reference population” is a subset of the population of organism genotypes. The reference population, or more accurately the genetic, phenotypic, and environmental data thereof, may be used as a reference against which the data for the validation population may be analysed. In preferred embodiments, the reference population may be of at least 100 genotypes. By a “validation population” is a different subset of the population of organism genotypes. The validation population, or more accurately the genetic, phenotypic, and environmental data thereof, may be used in comparison with the data of the reference population to extract information, for example, on certain features, characteristics or trends of the data. In preferred embodiments, the validation population may be at least 100 genotypes. To be clear, the step of obtaining genetic, phenotypic, and environmental data for a population of organism genotypes and then dividing it into a reference population and a validation population encompasses separately obtaining data for each of a reference population and a validation population. The step of dividing the data is simply to be taken to mean that the data for the reference and validation populations, however obtained, relates to the same type of data and is suitably comparable. For example. it may include the same type of genetic, phenotypic, and environmental data obtained for genotypes with the same organism species.
A genotype plus genotype×environment (GGE) analysis assesses genotype by environment interactions (GED of two-way data, GEI being a change in a phenotype of two or more genotypes measured in two or more environments. The Principal Component (PC) is determined by singular value decomposition. Mathematically, GGE is the genotype by environment data matrix after the environment means have been removed. Certain methods for calculating the GGE PC are known to those skilled in the art.
In preferred embodiments, the GGE PC is calculated using a non-linear iterative partial least squares method, preferably based on Equation 1 as follows
where Φij is the genotype×environment two-way matrix of GGE effects; i is the range between 1 and g (total number of genotypes); j is the range between 1 and e (total number of environments);
By a “polymorphism” is meant a genetic variation present at one or more positions of a nucleotide sequence which allows for genetic distinction when two or more are compared. Polymorphisms may be present on coding or non coding regions, as well as regulatory or non-regulatory regions, of the nucleotide sequence. A polymorphism may be for example an insertion, deletion, substitution, or combination thereof. In preferred embodiments, the polymorphism is at least one, if not several, single nucleotide polymorphisms (SNP). An SNP is a variation in a single nucleotide. Methods for identifying polymorphisms including SNPs are known to those skilled in the art.
The step of calculating the polymorphism effect for each GGE PC essentially determines the weight of the identified polymorphisms; the likelihood of the polymorphisms contributing to the GGE PC. Methods for calculating the polymorphism effect are known to those skilled in the art. A representative method utilises the Bayesian Ridge Regression (BRR) model.
By a “genomic estimated breeding value” (GEBV) is meant the measurable extent to which a genotype influences the expression of a phenotype. Calculating a genomic estimated breeding value (GEBV) for each genotype of the validation population using the calculated polymorphism effect for each PC essentially determines to what extent the identified polymorphisms influence the expression of a phenotype. In preferred embodiments, the GEBV is a G matrix (n×e), wherein n is the number of validation genotypes and e is the number of environments. Preferably, calculating a GEBV is based on Equation 2 as follows:
GEBV=Z{circumflex over (β)} (Equation 2)
where Z is the SNP allelic dosage matrix for the validation population and {circumflex over (β)} is the calculated polymorphism effect for each GGE PC.
By a “phenotypic genomic estimated breeding value” (pGEBV) is meant the measurable extent to which environment influences the GEBV; or in other words it relates environment to phenotype. In preferred embodiments, each pGEBV is calculated based on Equation 3 as follows:
pGEBV=G×R
−1 (Equation 3)
where G is an (n×e) matrix of GEBVs for the GGE PCs scaled by multiplying each PC with its standard deviation; n is the number of validation genotypes; e is the number of environments; and R−1 is an inverse of the rotation matrix (e×e), or the environment coordinate matrix scaled by dividing each column on the standard deviation of the correspondence PC.
In context of a method for determining pGEBVs, by an “organism” is meant a living being, whether an animal, plant, single-celled organism or other. In context of producing or obtaining an improved organism as described herein, by an “organism” is meant the same, except that its reference to an animal does not include a human being. That is, the present invention is not intended to relate to biological processes for the generation of a human being. In preferred embodiments, the organism is an animal other than a human being, or a plant.
In the context of a plant, the plant may be any cultivable plant. In preferred embodiments, the plant is a crop plant which can be cultivated and harvested for food, animal feed, fibre, oil, any other material or industrial use. For example, the plant may be for the production of pomes, citrus, and other fruits, nuts, cereals, legumes, vegetables, herbs, spices and commodities including oil. This may include, for example, plants belonging to the genus Triticum, including T. aestivum (wheat), Hordeum, including H. vulgare (barley), Zea, including Z. mays (maize or corn), Oryza, including O. sativa (rice), Saccharum including S. officinarum (sugarcane), Sorghum including S. bicolor (sorghum), Panicum, including P. virgatum (switchgrass), Helianthus (sunflower), Brassica (canola), Vigna, Cicer, Lens, Pisum (beans) Coffea (coffee) Miscanthus, Paspalum, Pennisetum, Poa, Eragrostis, Agrostis, Brachiaria, Lolium and Festucae (grasses).
In the context of an animal, the animal may be any productive animal. In preferred embodiments, the animal is one to which practices of animal husbandry are applicable for the production of food, animal feed, fibre, or any other material or for industrial use. For example, the animal may be for the production of meat and meat-derived products, poultry, eggs, dairy, fish, wool and leather. This may include, for example, animals belonging to the genus Bos (cattle), Equus (horse), Ovis (sheep), Sus (pig), Capra (goat) and Gallus (chicken).
This method provides the advantage of a significantly more accurate method for calculating in which environment a genotype excels. For example, as determined against the reference population of the obtained data, the method calculates with much greater accuracy in which environment the genotypes of the obtained data were more productively yielding, than the two sub-models (termed GE and GxE) of the standard GBLUP model with the following Equation A:
y=μ+E+g+gE+ϵ
where y is the phenotype; μ is the intercept; E is the environmental effect E˜N(0,VEσE2), VE=ZEZ′E, ZE is the incidence matrix allocating genotypes to environments; g is the genotypic effects g˜N(0,Vgσg2), Vg=ZgGZ′g, Zg is the incidence matrix allocating phenotypes to genotypes, G is the genomic relatedness matrix estimated following the first method described in VanRaden (2008); and ϵ is the residual ϵ˜N(0, σE2). There are two sub-models; with and without GEI. gE represents the GEI effect and was equal to 0 for the model without GEI (named GE). For the model that fitted GEI (named GxE), gE˜N(0,VgEσgE2), VgE=Vg⊙VE, ⊙ is the Hadamard or cell-to-cell product.
The accuracy advantage arises from the calculations that assume that all variation attributed to genotypes and GEI is captured by all PCs of the GGE analysis. For this reason, the GS on these PCs (instead of the actual phenotypes) is applied before converting the GEBVs of the PCs back to the original phenotypes.
In another advantage, the method finds particular utility in selecting a genotype for a given environment. That is, it can calculate for example, phenotype potential of genotypes in unobserved environments with improved accuracy. In the context of a plant, this may include selecting a plant genotype based on its pGEBV for cultivation in a new environment.
In preferred embodiments, the method is used for selecting a genotype for producing an improved organism in a given environment, by one or more of the following:
pGEBV=G×R
−1
using the reordered U matrix instead of the R matrix, where G is an (n×e) matrix of GEBVs for the GGE PCs scaled by multiplying each PC with its standard deviation; n is the number of validation individuals; and e is the number of environments; and adding a column of zeros to the end of the G matrix to match its dimensions; and
Accordingly, in another aspect of the present invention, there is provided a method for selecting a genotype for producing an improved organism in a given environment, wherein the method may comprise the steps of:
pGEBV=G×R
−1
Preferably, when reordering columns of the U matrix to match an order of the rotation matrix, the column of the U matrix that has the highest absolute correlation coefficient value is ordered with the first column in the rotation matrix. In this process, the extra column in the U matrix that does not have high correlation with any column in the rotation matrix would become the last column in the U matrix.
In preferred embodiments, the given environment is a new environment that is not included in the reference population.
The third method is particularly accurate and accordingly advantageous. It assumes that the pGEBVs for the unobserved environment can be calculated from its correlation with the reference environments, as well as the GEBVs of the GGE principal components for the reference environments. This method showed the highest average accuracy for calculating new environments. It can also be applied in breeding programs where massive populations get screened in multiple environments or seasons with high-throughput phenotyping techniques. Moreover, this method is computationally more efficient in terms of memory and time requirements.
In preferred embodiments of selecting a genotype, an organism of the selected genotype is located to the given environment, and an improved organism is produced. In context, an “improved organism” encompasses a single organism and also a plurality. The improved organism may have any advantageous phenotype, generally as compared to an organism with a lower pGEBV or GEI correlation for the given environment. For example, in the context of a plant, preferably a plant of the selected genotype is planted in the given environment, and an improved plant is produced. In context, an “improved plant” encompasses a single plant and also a cultivar or crop. The improvement may be for example by way of a larger yield which may be characterised by a larger, denser or otherwise higher producing plant or plant part.
Accordingly, in another aspect of the present invention, there is provided a method for producing an improved organism, comprising the steps of:
Of course, by locating an organism in a given environment encompasses tending to it; for example in the context of a plant, planting, cultivating, cropping, fertilising etc. as appropriate. In preferred embodiments of this aspect of the present invention, the method is for obtaining an improved plant.
In this specification, the term ‘comprises’ and its variants are not intended to exclude the presence of other integers, components or steps.
In this specification, reference to any prior art in the specification is not and should not be taken as an acknowledgement or any form of suggestion that this prior art forms part of the common general knowledge in Australia or any other jurisdiction or that this prior art could reasonably expected to be combined by a person skilled in the art.
The present invention will now be more fully described with reference to the accompanying Examples and drawings. It should be understood, however, that the description following is illustrative only and should not be taken in any way as a restriction on the generality of the invention described above.
A computationally efficient model has been developed that combines GGE analysis with genomic selection, named 3GS, to improve the accuracy for calculating GEI. The model first estimates marker effects for all PCs produced by a GGE analysis, before using the effects to calculate GEBVs for new genotypes. Then it converts the GEBVs to pGEBV by multiplying them with the inverse of the rotation matrix.
The performance of 3GS was compared to standard GBLUP, with and without modeling GEI, using wheat grain yield data phenotypes in 20 diverse environments. Environments were grouped in two major clusters with pairwise phenotypic correlation coefficients ranging from −0.28 to 0.77. On average, 3GS showed 12% higher accuracy compared to the best GBLUP model over all environments. The accuracy advantage happens primarily in one cluster when low to negative correlations are present between environments with around 31% higher accuracy than GBLUP. A statistical method was also developed to calculate unobserved genotypes in unobserved environments with good accuracy based on their correlations with the reference environments. When run as a multithread version, the 3GS model is about 80 times faster than the GBLUP model implemented in the BGLR package (required 30 seconds vs 40 minutes for BGLR). This computational efficiency is expected to further increase for larger datasets. The 3GS model improves calculation accuracy for traits with complex GEI and exhibited enhanced performance for negatively correlated environments.
The phenotypic and genotypic data for a total of 367 spring wheat genotypes were downloaded from the TCAP database (https://triticeaetoolbox.org/wheat). The phenotypic data included grain yield records for 20 field trials conducted between 2011 and 2014 with irrigation and rain-fed treatments. The trials were distributed in seven geographical locations across the United States (Davis, Imperial, Bozeman, Huntley and Othello), Mexico (Obregon) and Canada (Saskatoon) with at least 250 genotypes per trial. Trial names were coded with the first two letters of the location name followed by the season (11 to 14) followed by the treatment (IRR for irrigation and RF for rainfed). A total of 144 genotypes with phenotypic records in almost all trials (missing rate of phenotypic records=0.8%) were used as a reference population. The remaining genotypes were used for validation to avoid overlap between the reference and validation populations. The population was genotyped with 90K Infinium single nucleotide polymorphism (SNP) chip which resulted in 22,214 SNPs after filtering for a minor allele frequency <5% and call rate <10%. Narrow sense heritability was estimated using the genomic-relatedness-based restricted maximum likelihood (GREML) analysis by fitting the genomic-relatedness matrix in the mixed linear model implemented in MTG2 software (Lee et al., 2012; Lee and van der Werf, 2016).
GGE analysis was conducted with the nonlinear iterative partial least squares (NIPALS) method implemented in the R package ‘GGE’ (http://kwstat.github.io/ggen. The general equation for the GGE model following Yan (2000) is:
where Φij is the genotype x environment two-way matrix of GGE effects; i is the range between 1 and g (total number of genotypes); j is the range between 1 and e (total number of environments);
The 3GS model implements the following major steps:
1—Calculate the GGE PCs for the phenotypic data of the reference population (144 genotypes) using equation [1];
2—Estimate the SNP effects for each PC. The Bayesian Ridge Regression (BRR) model 25 was used to calculate SNP effects as implemented in the R package BGLR (Pérez and de Los Campos, 2014). The analysis was run with 10,000 iterations with the first 5,000 iterations considered as burn-in. The analysis was multithreaded by running each PC on a different core;
3—Calculate GEBVs for the validation population (n=223 genotypes) using the SNP effects for the PCs as GEBV=Z{circumflex over (β)} (Equation 2) where Z is the SNP allelic dosage matrix for the validation population and {circumflex over (β)} is the SNP effects estimated in Step 2; and
4—Covert the GEBVs of the GGE PCs into pGEBV for each environment with the following equation:
pGEBV=G×R
−1 (Equation [3])
where G is an (n×e) matrix of GEBVs for the GGE PCs scaled by multiplying each PC with its standard deviation; n is the number of validation individuals; R−1 is the inverse of the rotation matrix (e×e), or the environment coordinate matrix which was scaled by dividing each column on the standard deviation of the correspondence PC. pGEBV is an (n×e) matrix so each environment had its own pGEBV values. Accuracy of genomic calculation was calculated as the Pearson correlation between pGEBV and the actual phenotypic record for each environment. To calculate standard deviations for accuracy estimations, accuracies were calculated on 100 replicates of randomly selected 80% of the validation population. Only scenarios of calculating untested genotypes in observed or unobserved (new) environments were considered for validation.
The GE, GxE and 3GS analyses were repeated 20 times after excluding one environment in each run to be used for validation and to assess the capability of these models to calculate new environments that were not included in the reference. The GE model resulted in a single GEBV per individual over all environments, while the GxE and 3GS models produced environment specific GEBVs for each reference environment. The following three approaches to calculate new environments were compared, of which the first two are also applicable for the GxE model:
The 3GS model was compared to the standard GBLUP model with the following equation:
y=μ+E+g+gE+ϵ (Equation [A])
where y is the phenotype; μ is the intercept; E is the environmental effect E˜N(0,VEσhd E2), VE=ZEZ′E, ZE is the incidence matrix allocating genotypes to environments; g is the genotypic effects g˜N(0,Vgσg2), VgGZ′g, ZG is the incidence matrix allocating phenotypes to genotypes, G is the genomic relatedness matrix estimated following the first method described in VanRaden (2008); and ϵ is the residual ϵ˜N(0,σE2).
gE represents the GEI effect and was equal to zero for the model without GEI (named GE). For the model that fitted GEI (named GxE), gE˜N(0,VgEσgE2), VgE=Vg⊙VE, ⊙ is the Hadamard or cell-to-cell product. Both models were fitted in BGLR (Pérez and de Los Campos, 2014).
The twenty environments had a narrow sense heritability (h2) value ranging between 0.11 and 0.62 with an average of 0.31 and they were clustered in two major groups (Table 1;
The 3GS model was compared to the standard GBLUP model without (GE) and with (GxE) modelling GEI considering the 20 environments in the reference population. The results clearly demonstrated increased calculation accuracy when using 3GS compared to both GBLUP models. On average over all environments, applying 3GS increased the accuracy by 70% compared to the GE model and by 12% compared to the GxE model (0.252 for 3GS vs 0.164 for GE and 0.226 for GxE; Table 1). The calculation accuracy advantage occurred prominently in environments belonging to Cluster 2, where the accuracy of the 3GS model (r=0.293) was more than double that of the GE model (x=0.132) and was 31% higher than the GxE model (r=0.224). The average calculation accuracies of Cluster 1 environments were comparable between the 3GS, GE and GxE models: 0.217, 0.196 and 0.227, respectively (Table 1).
The pGEBV solutions produced by the 3GS model for environments within Cluster 1 were very comparable to the solutions produced by the GxE model. The average correlation coefficients between both models was 0.95, which ranged from 0.91 to 0.99 (
Comparisons of the correlations between pGEBVs produced by the 3GS and GxE models to the phenotypic correlations showed that the GxE model tended to overestimate the correlation among environments, while 3GS produced more realistic estimates (Tables 2A-C). In Tables 2A-C, positive values indicate positive pairwise correlations, and negative values indicate negative pairwise correlations. The depth of shading is indicative of the strength of correlation, with deeper shades representing stronger pairwise correlations. A positive correlation indicates that genotypes in both environments have the same phenotype. For example, genotypes that have high phenotypes in one environment would be expected to have high phenotypes in another environment. The higher the correlation value, the stronger the relationship between the environments. On the other hand, negative correlations indicate the reverse. For example, genotypes that have high phenotypes in one environment are expected to have a low phenotype in another environment. Almost all pairwise correlations for the pGEBVs of the GxE model were higher than those of the phenotypic data, with an average increase of 0.35. Environments in Cluster 1 showed a higher average increase (0.43) compared to environments in Cluster 2 (0.32). The pGEBVs of the 3GS model showed higher correlations only for environments within Cluster 1 (average 0.26 increase), while differences for Cluster 2 and inter-cluster correlations ranged from −0.41 to 0.65, with an average of zero. The average absolute differences between the correlations of the pGEBVs of the 3GS model and the phenotypic correlations was equal to 0.21, which was smaller than that of the GxE model (inferred from Table 2C to be equal to 0.35).
indicates data missing or illegible when filed
indicates data missing or illegible when filed
indicates data missing or illegible when filed
Omitting one environment from the reference population at a time showed that 3GS can calculate new environments with good accuracy. As the initial models for 3GS and GxE only produce pGEBVs for the reference environments, three novel methods were assessed to calculate new environments. The first and third methods further increased overall calculation accuracy for both the GxE and 3GS models, compared to the GE. This improvement resulted mainly from higher calculation accuracy for environments in Cluster 2 (Table 1). The first method is the simplest to implement because it directly calculates the accuracy of the unobserved environment from the pGEBVs of the reference environment that has the highest correlation with it. This method performed comparable in both the GxE and 3GS models, with an average calculation accuracy of 0.180 and 0.176, respectively (Table 1). The second method calculates the mean of pGEBVs within each cluster of environments ‘or mega environment’ for each individual. The first method calculated new environments more accurately than this second method (Table 1). The third method assumes that the pGEBVs for the unobserved environment can be calculated from its correlation with the reference environments, as well as the GEBVs of the GGE principal components for the reference environments. For this reason, it is specific to 3GS model. This method showed the highest average accuracy for calculating new environments (average r=0.196) compared to the other two methods, regardless of whether they were applied on the GxE or 3GS model (Table 1).
The 3GS model was computationally very efficient in terms of memory and time requirements. Calculating each PC required less than 30 seconds and is a process that can be easily parallelized. Hence, if the number of threads was equal to the number of environments, the entire analysis would require the same amount of time needed to analyze a single PC. The analysis also required a maximum of 2.6 GB of RAM per thread which is slightly larger than the size of the genotypic data. In contrast, the GE model required slightly less than 3.5 minutes and 2.6 GB of RAM to run, while the GxE model required around 40 minutes and a maximum memory of 4.5 GB.
Several models have previously been proposed to fit GEI with genomic prediction. Burgueño et al. (2012) were the first to fit genetic correlation with GBLUP to account for GEI in crops, while Jarquin et al. (2014) were first to extend the GBLUP model to fit environmental covariates and their interactions with genetic variants. The linear GBLUP model proposed by López-Cruz et al. (2015) decomposes variant effects into main or stability effects and environment-specific deviations, while Cuevas et al. (2016) implemented the same model in a nonlinear Gaussian Kernel (GK) framework. GK models are less practical because they capture epistatic effects that get disrupted over generations (He et al. 2017; Jiang et al. 2018). The latter two models assume the environments are positively correlated because the correlation is inferred as a proportion of the total variation, which makes them inefficient for uncorrelated and negatively correlated environments. Cuevas et al. (2017) modeled genetic effects as the Kronecker produce of the correlations between environments and the genomic relatedness matrix. This approach showed comparable accuracies to the model proposed by Burgueño et al. (2012) when implemented in the GBLUP method. They also extended the model with a parameter representing the random effect among environments, which improved the calculation accuracy. However, this parameter cannot be applied to new genotypes that are not included in the training population, which limits its application in breeding programs. Compared to these previously described models, 3GS offers several advantages in terms of the complexity of the correlation structure among environments, ability to calculate new genotypes in unobserved environments and computational resource requirements.
The 3GS model gave higher calculation accuracy compared to the GxE model for environments that are less related to other environments in the reference. 3GS is therefore more robust in calculating complex interactions of quantitative trait loci with environments (Hayes et al. 2016). This was further confirmed by the ability of 3GS model to produce pGEBVs with comparable pairwise correlation values to those calculated using the original phenotypic data. In contrast, the GxE model consistently overestimated the relatedness among environments and flipped negatively correlated environments into positively correlated ones. Another advantage of the 3GS model is the calculation of the principal components of the GGE analysis, which allows all phenotyped and unphenotyped individuals in a GGE biplot to be compared for better selection decisions.
One of the main difficulties for GS models that fit GEI is the ability to calculate unobserved genotypes in unobserved environments. Most previously published models calculate their accuracies by calculating the performance of new genotypes in the reference environments or by calculating incomplete field trials only (Burgueño et al. 2012; Jarquin et al. 2014; López-Cruz et al. 2015; Cuevas et al. 2016, 2017). Jarquin et al. (2017) described different models to exploit GEI that allowed calculation of unobserved environments, but the accuracies of their models were very low when calculating new genotypes in new environments. In contrast, 3GS is very promising for calculating the performance of populations in new environments or future climates. The third method as detailed above for calculating unobserved environments, and which is specific to the 3GS model, showed an enhanced performance compared to the other two methods applied to both models. The concept behind this method assumes that the variance of the extra PC representing the special variance component of the new environment is equal to zero; in other words, it assumes that the new environment does not add any new variation to the dataset so it will be completely dependent on the reference environments given its correlation with them.
The first method for calculating new environments as detailed above was more biased than the third method because it infers its calculation from only one environment (the new environment that has the highest correlation with the target new environment) which might not be the true calculator of the unobserved environment. This bias was noticed in the data as the method calculated many environments with zero accuracy, despite being well calculated with the other methods (Table 1). For this reason, implementing the third method to calculate new environments in breeding programs is recommended. The second method did not perform as well on the current dataset. However, in very large multi-environmental trials where each mega-environment is well represented in the reference dataset and distinguished from other mega-environments, this method could have better accuracy.
In general, the complexity for analyzing multi-environmental trials increases exponentially when moving from a univariate model (single environment) to a multivariate (multi environment) model. The R package BGGE (Granato et al. 2018) exploits the sparsity of covariance matrices to reduce the computational demand and was shown to be up to five times faster than the classical solver implemented in the R package BGLR (Pérez and de Los Campos, 2014). The multi-trait deep learning (MTDL) model proposed by Montesinos-López et al. (2018) can be parallelized to reduce computational time, while the variational Bayes model (BVM) proposed by Montesinos-López et al. (2017) was around 10 times faster than conventional Bayesian approaches. Nevertheless, each of these models is still computationally demanding due to existence of higher dimensionality compared to the univariate models. In contrast, calculating each PC in 3GS is equivalent to running a univariate GS model, meaning that the complexity of the analysis increases linearly with an increasing number of environments. This makes the 3GS model highly efficient from a computational standpoint in terms of both memory and time requirements. Moreover, due to the independency of the GGE PCs, 3GS model can be easily parallelized and with enough CPU cores, the total computational time can be reduced to the time required to analyze a single environment. The ability to calculate SNP effects for the PCs of GGE analysis also make it easier to calculate new genotypes using these effects, instead of repeating the entire analysis as in GBLUP (Lourenco et al. 2015).
The 3GS model runs optimally for a ‘semi-balanced’ dataset across environments. However, while the reference population is expected to have phenotypic records in all environments, PC imputation algorithms such as the nonlinear iterative partial least squares (NIPALS) can be used to infer some missing phenotypic data in 3GS with minimal effect on calculation accuracy. Preda et al. (2010) reported that less than 5% of missing data had no significant effect, while up to 15% of missing data had an acceptable effect on the accuracy of calculating different PCs. Sattari et al. (2017) showed that imputing 10% missing precipitation data with NIPALS method had an accuracy of 0.94.
A novel computational model called 3GS has been developed that combines genomic selection with genotype plus genotype×environment interaction (GGE) analysis. The new model improved calculation accuracy above previously reported models that exploit GEI. It also has more elasticity to model complex relationships among environments without inflating the correlation coefficients and does not appear to be impacted by negative correlations among environments. Unlike previous models that consider GEI, 3GS is sufficiently flexible to calculate new genotypes in unobserved environments with good accuracy. Moreover, 3GS has a computational advantage over existing models, especially for massive datasets, because its complexity increases linearly with an increasing the number of environments. For this reason, 3GS can be optimally applied in current modern breeding programs where massive populations get screened in multiple environments or seasons with high-throughput phenotyping techniques.
Finally, it is to be understood that various alterations, modifications and/or additions may be made without departing from the spirit of the present invention as outlined herein.
Burgueño, J., de los Campos, G., Weigel, K., & Crossa, J. (2012). Genomic prediction of breeding values when modeling genotype×environment interaction using pedigree and dense molecular markers. Crop Science, 52(2), 707-719
Cuevas, J., Crossa, J., Soberanis, V., Pérez-Elizalde, S., Perez-Rodriguez, P., Campos, G. D. L., et al. (2016). Genomic prediction of genotype×environment interaction kernel regression models. The plant genome, 9(3), 1-20
Number | Date | Country | Kind |
---|---|---|---|
2020904770 | Dec 2020 | AU | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/AU2021/051511 | 12/17/2021 | WO |