This invention is in the field of plant breeding. More specifically, this invention relates to the use of expression profiling technology in activities related to germplasm improvement and to methods and compositions for breeding crop plants with enhanced yield.
A sequence listing containing the file named “54066seq.txt” which is 422 bytes (measured in MS-Windows®) and created on Jul. 28, 2008, comprises 438 nucleotide sequences, and is herein incorporated by reference in its entirety.
A key objective of plant breeding is to increase genetic gain, often in terms of yield. Plant breeders have historically relied on phenotypic measurements as the basis of selection in order to achieve yield gain. There remains a need in the art for the identification of methods and compositions to increase the efficiency of plant breeding.
The present invention utilizes expression profiling technology in combination with a modified association study approach to map and identify targets for yield in corn plants. The compositions provided herein are useful in marker-assisted selection and as expression constructs in transgenic corn plants. The methods provided herein are exemplified in corn plants though the inventors contemplate utility in other crop plants, as well as in traits including but not limited to yield.
In one embodiment, the present invention is directed to a method of plant breeding. The method comprises determining the expression profile for at least one nucleic acid within the genome of at least one crop plant in a breeding population; associating the determined expression profile with at least one numerical value wherein the numerical value is related to one or more phenotypic traits; and making a plant breeding decision for the one or more crop plants based on the association. In more particular embodiments, the step of making the plant breeding decision may comprise selecting among breeding populations based on the at least one numerical value; selecting progeny in one or more breeding populations based on the at least one numerical value;
predicting progeny performance of parental lines and selecting parental lines based on the predicted progeny performance; advancing lines in germplasm improvement activities based on the at least one numerical value; and selecting for at least one phenotypic trait based on the at least one numerical value associated with another phenotypic trait.
In another embodiment, the invention is directed to a method of plant breeding comprising providing at least two plants wherein at least one expression profile is assayed for at least one locus for at each plant; determining an expression profile effect estimate for at least one phenotypic trait for at least two expression profiles for the at least one locus; and making plant breeding decisions based on the determined expression profile effect estimate for the at least one phenotypic trait.
In another embodiment, the invention is directed to a method of plant breeding comprising establishing a fingerprint map defining a plurality of loci within the genome of a breeding population; associating a QTL allele with known map location with a phenotypic trait; and assaying at least one other plant for presence of the QTL allele using at least one marker corresponding to at least one of the plurality of expression profiles to predict expression of the phenotypic trait.
In another embodiment, the invention is directed to a method of marker assisted breeding comprising providing a breeding population comprising at least two plants; associating at least one phenotypic trait with at least one locus of the genome of the plants of the breeding population, wherein the locus is defined by at least one nucleic acid sequence; and assaying for the presence of the at least one nucleic acid sequence of the at least one locus to predict expression of at least one phenotypic trait in a progeny plant of the breeding population.
In another embodiment, the invention is directed to a method of selecting a breeding population for use in a breeding program. The method comprises providing at least two distinct breeding populations; using a plurality of breeding values for at least one phenotypic trait for at least two expression profiles for at least one locus for the breeding populations; and selecting at least one breeding population based on at least one breeding value.
In a further embodiment, the invention is directed to an isolated nucleic acid marker comprising at least 20 consecutive nucleotides corresponding to a nucleic acid designated as SEQ ID NO 1-438, associated with enhanced yield.
In another embodiment, the invention is directed to a plant or parts thereof comprising at least one nucleic acid at least 80% identical to a nucleic acid selected from the group consisting of SEQ ID NO 1-438.
In another embodiment, the invention is directed to a substantially purified nucleic acid molecule comprising a nucleic acid sequence wherein said nucleic acid sequence exhibits 95% or greater identity to a nucleic acid sequence selected from the group consisting of SEQ ID NO: 1 through SEQ ID NO: 338 and sequences complementary to SEQ ID NO: 1 through SEQ ID NO: 338.
In still a further embodiment, the invention is directed to a transgenic plant comprising a nucleic acid molecule, wherein the nucleic acid molecule comprises: (a) an exogenous promoter region that functions in a plant cell to cause the production of an mRNA molecule, operably linked to (b) a structural nucleic acid molecule comprising a nucleic acid sequence having 95% or greater identity to a nucleic acid sequence selected from the group consisting of: (i) SEQ ID NO: 1 through SEQ ID NO: 338, (ii) a sequence complementary to SEQ ID NO: 1 through SEQ ID NO: 338, and (iii) any fragment of a nucleic acid sequence from “(i)” and/or “(ii)”; operably linked to (c) a 3′ non-translated sequence that functions in said plant cell to cause termination of transcription and/or addition of polyadenylated ribonucleotides to the 3′ end of said mRNA molecule.
The present invention includes a method for breeding of a crop plant, such as maize (Zea mays), soybean (Glycine max), cotton (Gossypium hirsutum), peanut (Arachis hypogaea), barley (Hordeum vulgare); oats (Avena sativa); orchard grass (Dactylis glomerata); rice (Oryza sativa, including indica and japonica varieties); sorghum (Sorghum bicolor); sugar cane (Saccharum sp); tall fescue (Festuca arundinacea); turfgrass species (e.g. species: Agrostis stolonifera, Poa pratensis, Stenotaphrum secundatum); wheat (Triticum aestivum), and alfalfa (Medicago sativa), members of the genus Brassica, broccoli, cabbage, carrot, cauliflower, Chinese cabbage, cucumber, dry bean, eggplant, fennel, garden beans, gourd, leek, lettuce, melon, okra, onion, pea, pepper, pumpkin, radish, spinach, squash, sweet corn, tomato, watermelon, ornamental plants, and other fruit, vegetable, tuber, oilseed, and root crops, wherein oilseed crops include soybean, canola, oil seed rape, oil palm, sunflower, olive, corn, cottonseed, peanut, flaxseed, safflower, and coconut, with enhanced traits comprising at least one sequence of interest, further defined as conferring a preferred property selected from the group consisting of herbicide tolerance, disease resistance, insect or pest resistance, altered fatty acid, protein or carbohydrate metabolism, increased grain yield, increased oil, increased nutritional content, increased growth rates, enhanced stress tolerance, preferred maturity, enhanced organoleptic properties, altered morphological characteristics, other phenotypic traits, traits for industrial uses, or traits for improved consumer appeal, wherein said traits may be nontransgenic or transgenic.
Non-limiting examples of silage quality traits include brown midrib (BMR) traits, in vitro digestibility of dry matter, leafiness, horny endosperm, crude protein, neutral detergent fiber, neutral detergent fiber digestibility, starch content, starch availability, kernel texture, milk/ton, fat content of milk, readily available energy, soluble carbohydrate digestibility, nonsoluble carbohydrate digestibility, reduced phytate production, reduced waste production, and silage yield.
Non-limiting examples of grain quality traits for biofuel yield include total biomass, fermentation yield, fermentation kinetics, total starch, extractable starch, starch morphology, phosphorous availability, waxy traits, glucose content, total oil content, germ oil content, endosperm oil content, fatty acid composition, kernel or seed morphology, amylose content, amylopectin content, protein composition and content (in particular, for end-use in animal feed following fractionation).
The present invention also provides for plants and parts thereof with compositions of one or more preferred nucleic acid sequences as described herein.
The definitions and methods provided define the present invention and guide those of ordinary skill in the art in the practice of the present invention. Unless otherwise noted, terms are to be understood according to conventional usage by those of ordinary skill in the relevant art. Definitions of common terms in molecular biology may also be found in Alberts et al., Molecular Biology of The Cell, 5th Edition, Garland Science Publishing, Inc.: New York, 2007; Rieger et al., Glossary of Genetics: Classical and Molecular, 5th edition, Springer-Verlag: New York, 1991; King et al, A Dictionary of Genetics, 6th ed, Oxford University Press: New York, 2002; and Lewin, Genes IX, Oxford University Press: New York, 2007. The nomenclature for DNA bases as set forth at 37 CFR §1.822 is used.
An “allele” refers to an alternative sequence at a particular locus; the length of an allele can be as small as 1 nucleotide base, but is typically larger. Allelic sequence can be denoted as nucleic acid sequence or as amino acid sequence that is encoded by the nucleic acid sequence.
A “locus” is a position on a genomic sequence that is usually found by a point of reference; e.g., a short DNA sequence that is a gene, or part of a gene or intergenic region. The loci of this invention comprise one or more polymorphisms in a population; i.e., alternative alleles present in some individuals.
As used herein, an “expression variant” refers to cases where two or more distinct RNA variants are transcribed from a DNA sequence. Expression variants can comprise alternative ribonucleotide arrangements wherein one or more introns are retained or one or more exons are spliced out. Expression variants can result in variant phenotypes, i.e., distinct polypeptides.
As used herein, “expression profile” means the characterization of a transcription product of a nucleic acid. In one aspect, an expression profile is presented as the relative hybridization of at least one target nucleic acid with at least one test nucleic acid, wherein the hybridization level may indicate up-regulation, down-regulation, or no change in expression level.
As used herein, “expression profile assay” means a method to query at least one target nucleic acid with at least one test nucleic acid using any means available in the art, including future platforms designed for measuring relative hybridization between two or more pools of nucleic acids. Non-limiting examples of platforms for detection of expression levels include DNA microarrays, DNA arrays, DNA chips, gene chips, bead-based arrays, RT-PCR, quantitative-RT PCR, and Northern blots.
As used herein, “target nucleic acid” means at least one nucleic acid physically associated with a substrate such as a bead, glass slide, or other microarray platform. The target nucleic acid may comprise an oligonucleotide sequence, a partial gene sequence, a full gene sequence, a non-coding sequence, and may overlap in sequence with other target nucleic acids.
As used herein, “test nucleic acid” means at least one nucleic acid that is combined with the target nucleic acid in order to evaluate level of hybridization, i.e., level of complementarity. This allows for the identification of polymorphisms between the test and target sequence, inference of regulatory modules, comparison of at least two test nucleic acids, and, in conjunction with phenotype data, association of expression profile levels with trait values to map QTL associated with expression (eQTL) or hybridization level (hQTL). In certain aspect, a test nucleic acid may RNA or cDNA.
Further, any nucleic acid, as used herein, may comprise one or more haplotypes, portions of one or more haplotypes, one or more genes, portions of one or more genes, one or more QTL, and portions of one or more QTL. In addition, a plurality of nucleic acids can comprise one or more haplotypes, portions of one or more haplotypes, one or more genes, portions of one or more genes, one or more QTL, and portions of one or more QTL. The nucleic acid may originate from a DNA or RNA template, either directly or indirectly (i.e., cDNA obtained from reverse transcription of mRNA).
As used herein, “expression profile effect estimate” means a predicted effect estimate for an expression profile reflecting association with one or more phenotypic traits, wherein said associations can be made de novo or by leveraging historical expression profile-trait association data or genotype-trait association data wherein at least one genotype is identified to be associated with at least one expression profile.
As used herein, “breeding value” means a calculation based on expression profile effect estimates and expression profile frequency values, wherein expression profile frequency can be measured in terms of the underlying nucleic acid sequence associated with the expression profile. The breeding value of a specific expression profile relative to other expression profiles at the same locus (i.e., haplotype window), or across loci (i.e., haplotype windows), can also be determined. In other words, the change in population mean by fixing a nucleic acid sequence associated with a particular expression profile is determined. In addition, in the context of evaluating the effect of substituting a specific region in the genome, either by introgression or a transgenic event, breeding values provide the basis for comparing specific nucleic acid sequences, and their corresponding expression profiles, for substitution effects. Also, in hybrid crops, the breeding value of expression profiles can be calculated in the context of the expression profile in the tester used to produce the hybrid.
As used herein, “genotype” means the genetic component of the phenotype and it can be indirectly characterized using markers or directly characterized by nucleic acid sequencing. Suitable markers include a phenotypic character, a metabolic profile, a genetic marker, or some other type of marker. A genotype may constitute an allele for at least one genetic marker locus or a haplotype for at least one haplotype window. In some embodiments, a genotype may represent a single locus and in others it may represent a genome-wide set of loci. In another embodiment, the genotype can reflect the sequence of a portion of a chromosome, an entire chromosome, a portion of the genome, and the entire genome. As opposed to a genetic marker such as a SNP, where the genotype comprises a single nucleotide, the genotype identified with the expression profiling may constitute a plurality of nucleotides, where the length of the genotype is contingent on the length of the target nucleic acid sequence and the number of nucleotides at any locus is contingent on the number of probes (i.e. bi-allelic vs. all four possible base pairs for DNA or RNA). Notably, a genetic marker assay as known in the art (e.g., SNP detection via TaqMan) detects only two alleles. An advantage of the present invention is the potential to directly query all four nucleotides (adenine, A; thymine, T; cytosine, C; and guanine, G) simultaneously at any one nucleotide position based on the representation of target nucleic acids in the expression profile assay. That is, for any one base pair position, there will be twice the information when using direct nucleic acid sequencing versus a genetic marker assays. This can be very important in determining whether two lines share DNA that is identical by decent. With a SNP genotype, one can only assess whether a pair of alternative nucleic acid bases exist at a single nucleotide locus. As used herein, an expression profile or a nucleic acid associated with an expression profile from each of two or more individual plants from the same genomic region, that may or may not be associated with one or more phenotypic trait values, provides the basis for decisions related to germplasm improvement activities, wherein one or more loci can be evaluated. Knowing whether two sequences at a locus are completely identical or if they contain combinations of identical and non-identical loci can aid in determining whether the loci have the same trait value, are linked to the same traits or are identical by descent. Therefore in another aspect, one or more nucleic acid sequences from one or more individual plants that are associated with a phenotypic trait value can provide the basis for decisions related to germplasm improvement activities.
As used herein, “polymorphism” means the presence of one or more variations of a nucleic acid sequence at one or more loci in a population of one or more individuals. The variation may comprise but is not limited to one or more base changes, the insertion of one or more nucleotides or the deletion of one or more nucleotides. A polymorphism may arise from random processes in nucleic acid replication, through mutagenesis, as a result of mobile genomic elements, from copy number variation and during the process of meiosis, such as unequal crossing over, genome duplication and chromosome breaks and fusions. The variation can be commonly found, or may exist at low frequency within a population, the former having greater utility in general plant breeding and the latter may be associated with rare but important phenotypic variation. Useful polymorphisms may include single nucleotide polymorphisms (SNPs), insertions or deletions in DNA sequence (Indels), simple sequence repeats of DNA sequence (SSRs) a restriction fragment length polymorphism, and a tag SNP. A genetic marker, a gene, a DNA-derived sequence, a haplotype, a RNA-derived sequence, a promoter, a 5′ untranslated region of a gene, a 3′ untranslated region of a gene, microRNA, siRNA, a QTL, a satellite marker, a transgene, mRNA, ds mRNA, a transcriptional profile, and a methylation pattern may comprise polymorphisms. In addition, the presence, absence, or variation in copy number of the preceding may comprise a polymorphism.
As used herein, the term “single nucleotide polymorphism,” also referred to by the abbreviation “SNP,” means a polymorphism at a single site wherein said polymorphism constitutes a single base pair change, an insertion of one or more base pairs, or a deletion of one or more base pairs.
As used herein, “marker” means a detectable characteristic that can be used to discriminate between organisms. Examples of such characteristics may include genetic markers, protein composition, protein levels, oil composition, oil levels, carbohydrate composition, carbohydrate levels, fatty acid composition, fatty acid levels, amino acid composition, amino acid levels, biopolymers, pharmaceuticals, starch composition, starch levels, fermentable starch, fermentation yield, fermentation efficiency, energy yield, secondary compounds, metabolites, morphological characteristics, and agronomic characteristics. As used herein, “genetic marker” means polymorphic nucleic acid sequence or nucleic acid feature.
As used herein, “marker assay” means a method for detecting a polymorphism at a particular locus using a particular method, e.g. measurement of at least one phenotype (such as seed color, flower color, or other visually detectable trait), restriction fragment length polymorphism (RFLP), single base extension, electrophoresis, sequence alignment, allelic specific oligonucleotide hybridization (ASO), random amplified polymorphic DNA (RAPD), microarray-based technologies, and nucleic acid sequencing technologies, etc.
As used herein, the term “haplotype” means a chromosomal region within a haplotype window. Typically, the unique marker fingerprint combinations in each haplotype window define and differentiate individual haplotypes for that window. As used herein, a haplotype is defined and differentiated by one or more nucleic acid sequences at one or more loci within a “haplotype window,” wherein a nucleic acid sequence may represent a single base pair genotype (i.e., a SNP), an insertion or deletion, one or more contiguous base pairs (i.e., sequence nucleic acid, or a corresponding expression profile.
As used herein, the term “haplotype window” means a chromosomal region that is established by statistical analyses known to those of skill in the art and is in linkage disequilibrium. In the art, identity by state between two inbred individuals (or two gametes) at one or more molecular marker loci located within this region is taken as evidence of identity-by-descent of the entire region, wherein each haplotype window includes at least one polymorphic molecular marker. As used herein, haplotype windows are defined by at least two polymorphisms for at least one locus. Haplotype windows can be mapped along each chromosome in the genome and do not necessarily need to be contiguous. Haplotype windows are not fixed per se and, given the ever-increasing amount of nucleic acid sequence information, this invention anticipates the number and size of haplotype windows to evolve, with the number of windows increasing and their respective sizes decreasing, thus resulting in an ever-increasing degree confidence in ascertaining identity by descent based on the identity by state of genotypes. The advantage of using haplotype windows to delineate genomic regions of interest is the fact these genomic regions tend to be inherited as linkage blocks and thus are informative for association mapping and for tracking across multiple generations.
As used herein, “phenotype” means the detectable characteristics of a cell or organism which can be influenced by gene expression.
As used herein, “consensus sequence” means a constructed DNA sequence which identifies single nucleotide and indel polymorphisms in alleles at a locus. Consensus sequence can be based on either strand of DNA at the locus and states the nucleotide base of either one of each SNP in the locus and the nucleotide bases of all Indels in the locus. Thus, although a consensus sequence may not be a copy of an actual DNA sequence, a consensus sequence is useful for precisely designing primers and probes for actual polymorphisms in the locus.
As used herein, “linkage” refers to relative frequency at which types of gametes are produced in a cross. For example, if locus A has genes “A” or “a” and locus B has genes “B” or “b” and a cross between parent I with AABB and parent B with aabb will produce four possible gametes where the genes are segregated into AB, Ab, aB and ab. The null expectation is that there will be independent equal segregation into each of the four possible genotypes, i.e. with no linkage ¼ of the gametes will of each genotype. Segregation of gametes into a genotypes differing from ¼ are attributed to linkage.
As used herein, “linkage disequilibrium” is defined in the context of the relative frequency of gamete types in a population of many individuals in a single generation. If the frequency of allele A is p, a is p′, B is q and b is q′, then the expected frequency (with no linkage disequilibrium) of genotype AB is pq, Ab is pq′, aB is p′q and ab is p′q′. Any deviation from the expected frequency is called linkage disequilibrium. Two loci are said to be “genetically linked” when they are in linkage disequilibrium.
As used herein, “quantitative trait locus (QTL)” means a locus that controls to some degree numerically representable traits that are usually continuously distributed.
As used herein, “nucleic acid sequencing” means the determination of the order of nucleotides in a sample of nucleic acids, wherein nucleic acids include DNA and RNA molecules. “High throughput nucleic acid sequencing” means an automated and massively parallel approach for the determination of nucleotides in a sample of nucleic acids wherein examples of high throughput nucleic acid sequencing technology include, but are not limited to, platforms provided by 454 Life Sciences, Agencourt Bioscience, Applied Biosystems, LI-COR Biosciences, Microchip Biotechnologies, Network Biosystems, NimbleGen Systems, Illumina (Solexa), and VisiGen Biotechnologies, comprising but not limited to formats such as parallel bead arrays, sequencing by synthesis, sequencing by ligation, capillary electrophoresis, electronic microchips, “biochips,” microarrays, parallel microchips, and single-molecule arrays, as reviewed by Service (Science 2006 311:1544-1546).
As used herein, “aligning” or “alignment” of two or more nucleic acid sequences is the comparison of the nucleic acid sequences found at the same locus. Several methods of alignment are known in the art and are included in most of the popular bioinformatics packages.
As used herein, the term “primer” means a single strand of synthetic oligonucleotide, preferably from 10 to 120 nucleotides, which can be synthesized chemically or assembled from several chemically synthesized oligonucleotides. As used herein, primers may be used to initiate sequencing reactions and polymerase reactions, such as in gap fill reactions and PCR. As used herein, a primer will hybridize under the assay conditions specifically to a desired target sequence. As used herein, primers may be used to introduce a DNA tag, to introduce chemically modified bases, such as biotin labeled bases, or to introduce a hybridization sequence that can subsequently be used for capture, such as capture to a sequencing matrix or to an avidin-containing surface.
As used herein, “DNA amplification” means the in vitro synthesis of double stranded DNA through the use of a DNA polymerase. Typically, this is accomplished in a polymerase chain reaction (PCR) assay but may also include other methods such as a gap-fill reaction, mis-match repair, Klenow reaction, etc. DNA amplification is used to provide detectable or excess amounts of a specific DNA. It can also be used to incorporate into a target nucleic acid, hybridized probes, annealed adaptors and primers which may include specific functionality or information.
As used herein, the term “biomarker” refers to a marker representing at least one phenotype of interest in an organism, such as a crop plant, wherein the marker constitutes the characterization of a nucleic acid expression level or protein level such that is it present or absent, elevated, decreased, temporally variable, spatially variable, or quantifiable in some other aspect.
As used herein, the term “corn” means Zea mays or maize and includes all plant varieties that can be bred with corn, including wild maize species.
As used herein, the term “soybean” means Glycine max and includes all plant varieties that can be bred with soybean, including wild soybean species.
As used herein, the term “canola” means Brassica napus and B. campestris and includes all plant varieties than can be bred with canola, including wild Brassica species and other agricultural Brassica species.
As used herein, the term “elite line” means any line that has resulted from breeding and selection for superior agronomic performance. An elite plant is any plant from an elite line.
In accordance with the present invention, Applicants have discovered methods for breeding plants wherein breeding decisions can be made genotypically on the expression profile of at least one nucleic acid sequence associated with at least one phenotypic trait of interest. For example, the methods of the present invention provide for use of expression profiling technology to identify targets for selection and to serve as a secondary marker for making selection decisions for at least one locus of interest. Further, the methods of the present invention allow for improved flexibility, wherein the entire genome can be queried without reliance on pre-determined genetic markers and the development of genetic marker detection assays. In addition, any length of sequence from any locus can be leveraged to 1) determine genotype-trait associations, 2) discriminate between two or more lines, 3) predict line performance or hybrid performance and, ultimately, 4) provide the basis for improved breeding activities related to germplasm improvement. Further, this invention provides compositions for increased yield in a crop plant, specifically maize. This invention also provides methods and compositions for use of expression profiling technology as a biomarker to screen plants for response to stress or biological processes.
Breeding has advanced from selection for economically important traits in plants and animals based on phenotypic records of an individual and its relatives to the application of molecular genetics to identify genomic regions that contain valuable genetic traits. Inclusion of genetic markers in breeding programs has accelerated the genetic accumulation of valuable traits into a germplasm compared to that achieved based on phenotypic data only. Herein, “germplasm” includes breeding germplasm, breeding populations, collection of elite inbred lines, populations of random mating individuals, and biparental crosses. Genetic marker alleles (an “allele” is an alternative sequence at a locus) are used to identify plants that contain a desired genotype at multiple loci, and that are expected to transfer the desired genotype, along with a desired phenotype to their progeny. Genetic marker alleles can be used to identify plants that contain the desired genotype at one marker locus, several loci, or a haplotype, and that would be expected to transfer the desired genotype, along with a desired phenotype to their progeny. This process has been widely referenced and has served to greatly economize plant breeding by accelerating the fixation of advantageous alleles and also eliminating the need for phenotyping every generation.
Molecular breeding is often referred to as marker-assisted selection (MAS) and marker-assisted breeding (MAB), wherein MAS refers to making breeding decisions on the basis of molecular marker genotypes and MAB is a general term representing the use of molecular markers in plant breeding. In these types of molecular breeding programs, genetic marker alleles can be used to identify plants that contain the desired genotype at one marker locus, several loci, or a haplotype, and that would be expected to transfer the desired genotype, along with a desired phenotype to their progeny. Markers are highly useful in plant breeding because once established, they are not subject to environmental or epistatic interactions. Furthermore, certain types of markers are suited for high throughput detection, enabling rapid identification in a cost effective manner.
Marker discovery and development in crops provides the initial framework for applications to MAB (U.S. Pat. No. 5,437,697; US Patent Applications 2005/0204780, 2005/0216545, 2005/0218305). The resulting “genetic map” is the representation of the relative position of characterized loci (DNA markers or any other locus for which alleles can be identified) along the chromosomes. The measure of distance on this map is relative to the frequency of crossover events between sister chromatids at meiosis. As a set, polyallelic markers have served as a useful tool for fingerprinting plants to inform the degree of identity of lines or varieties (U.S. Pat. No. 6,207,367). These markers form the basis for determining associations with phenotype and can be used to drive genetic gain. The implementation of MAS, wherein selection decisions are based on marker genotypes, is dependent on the ability to detect underlying genetic differences between individuals.
Many individuals and companies have developed their own proprietary and trade marked versions of molecular breeding. One aspect in common is that they are based on markers to report differences which are then used to make selections. However, these markers provide no or very limited information on the differences at the DNA sequence level; for example, a typical biallelic SNP marker provides information on only one base pair position and it can only distinguish between 2, rather than 4, nucleotides. Using expression profile assays gives the power to query 4 nucleotides at any given position within a nucleic acid sequence as directed by inclusion of target nucleic acid sequences. Furthermore, this power will be useful to fingerprint plant populations or lineages to allow genome wide discovery of useful variation, build pedigrees or calculate breeding values.
The development of markers and the association of markers with phenotypes, or quantitative trait loci (QTL) mapping for marker-assisted breeding has advanced in recent years. Examples of genetic markers are Restriction Fragment Length Polymorphisms (RFLP), Amplified Fragment Length Polymorphisms (AFLP), Simple Sequence Repeats (SSR), Single Nucleotide Polymorphisms (SNP), Insertion/Deletion Polymorphisms (Indels), Variable Number Tandem Repeats (VNTR), and Random Amplified Polymorphic DNA (RAPD), and others known to those skilled in the art. Marker discovery and development in crops provides the initial framework for applications to marker-assisted breeding activities (US Patent Applications 2005/0204780, 2005/0216545, 2005/0218305, and 2006/00504538). The resulting “genetic map” is the representation of the relative position of characterized loci (DNA markers or any other locus for which alleles can be identified) along the chromosomes. The measure of distance on this map is relative to the frequency of crossover events between sister chromatids at meiosis.
As a set, polymorphic markers serve as a useful tool for fingerprinting plants to inform the degree of identity of lines or varieties (U.S. Pat. No. 6,207,367). These markers form the basis for determining associations with phenotype and can be used to drive genetic gain. The implementation of marker-assisted selection is dependent on the ability to detect underlying genetic differences between individuals.
Genetic markers for use in the present invention include “dominant” or “codominant” markers. “Codominant markers” reveal the presence of two or more alleles (two per diploid individual). “Dominant markers” reveal the presence of only a single allele. The presence of the dominant marker phenotype (e.g., a band of DNA) is an indication that one allele is present in either the homozygous or heterozygous condition. The absence of the dominant marker phenotype (e.g., absence of a DNA band) is merely evidence that “some other” undefined allele is present. In the case of populations where individuals are predominantly homozygous and loci are predominantly dimorphic, dominant and codominant markers can be equally valuable. As populations become more heterozygous and multiallelic, codominant markers often become more informative of the genotype than dominant markers.
Nucleic acid molecules or fragments thereof are capable of specifically hybridizing to other nucleic acid molecules under certain circumstances. As used herein, two nucleic acid molecules are capable of specifically hybridizing to one another if the two molecules are capable of forming an anti-parallel, double-stranded nucleic acid structure. A nucleic acid molecule is the “complement” of another nucleic acid molecule if they exhibit complete complementarity. As used herein, molecules are exhibit “complete complementarity” when every nucleotide of one of the molecules is complementary to a nucleotide of the other. Two molecules are “minimally complementary” if they can hybridize to one another with sufficient stability to permit them to remain annealed to one another under at least conventional “low-stringency” conditions. Similarly, the molecules are “complementary” if they can hybridize to one another with sufficient stability to permit them to remain annealed to one another under conventional “high-stringency” conditions. Conventional stringency conditions are described by Sambrook et al., In: Molecular Cloning, A Laboratory Manual, 2nd Edition, Cold Spring Harbor Press, Cold Spring Harbor, N.Y. (1989), and by Haymes et al., In: Nucleic Acid Hybridization, A Practical Approach, IRL Press, Washington, D.C. (1985). Departures from complete complementarity are therefore permissible, as long as such departures do not completely preclude the capacity of the molecules to form a double-stranded structure. In order for a nucleic acid molecule to serve as a primer or probe it need only be sufficiently complementary in sequence to be able to form a stable double-stranded structure under the particular solvent and salt concentrations employed.
As used herein, a substantially homologous sequence is a nucleic acid sequence that will specifically hybridize to the complement of the nucleic acid sequence to which it is being compared under high stringency conditions. The nucleic-acid probes and primers of the present invention can hybridize under stringent conditions to a target DNA sequence. The term “stringent hybridization conditions” is defined as conditions under which a probe or primer hybridizes specifically with a target sequence(s) and not with non-target sequences, as can be determined empirically. The term “stringent conditions” is functionally defined with regard to the hybridization of a nucleic-acid probe to a target nucleic acid (i.e., to a particular nucleic-acid sequence of interest) by the specific hybridization procedure discussed in Sambrook et al., 1989, at 9.52-9.55. See also, Sambrook et al., 1989 at 9.47-9.52, 9.56-9.58; Kanehisa 1984 Nucl. Acids Res. 12:203-213; and Wetmur et al. 1968 J. Mol. Biol. 31:349-370. Appropriate stringency conditions that promote DNA hybridization are, for example, 6.0× sodium chloride/sodium citrate (SSC) at about 45° C., followed by a wash of 2.0×SSC at 50° C., are known to those skilled in the art or can be found in Current Protocols in Molecular Biology, John Wiley & Sons, N.Y., 1989, 6.3.1-6.3.6. For example, the salt concentration in the wash step can be selected from a low stringency of about 2.0×SSC at 50° C. to a high stringency of about 0.2×SSC at 50° C. In addition, the temperature in the wash step can be increased from low stringency conditions at room temperature, about 22° C., to high stringency conditions at about 65° C. Both temperature and salt may be varied, or either the temperature or the salt concentration may be held constant while the other variable is changed.
A fragment of a nucleic acid molecule can be any sized fragment and illustrative fragments include fragments of nucleic acid sequences set forth in SEQ ID NO: 1-176 and complements thereof. In one aspect, a fragment can be between 15 and 25, 15 and 30, 15 and 40, 15 and 50, 15 and 100, 20 and 25, 20 and 30, 20 and 40, 20 and 50, 20 and 100, 25 and 30, 25 and 40, 25 and 50, 25 and 100, 30 and 40, 30 and 50, and 30 and 100. In another aspect, the fragment can be greater than 10, 15, 20, 25, 30, 35, 40, 50, 100, or 250 nucleotides.
Additional genetic markers can be used to select plants with an allele of a QTL associated with transgene modulating loci of the present invention. Examples of public marker databases include, for example: Maize Genome Database, Agricultural Research Service, United States Department of Agriculture or Soybase, an Agricultural Research Service, United States Department of Agriculture.
In another embodiment, markers, such as single sequence repeat markers (SSR), AFLP markers, RFLP markers, RAPD markers, phenotypic markers, isozyme markers, single nucleotide polymorphisms (SNPs), insertions or deletions (Indels), single feature polymorphisms (SFPs, for example, as described in Borevitz et al. 2003 Gen. Res. 13:513-523), microarray transcription profiles, DNA-derived sequences, and RNA-derived sequences that are genetically linked to or correlated with alleles of a QTL of the present invention can be utilized.
In one embodiment, nucleic acid-based analyses for the presence or absence of the genetic polymorphism can be used for the selection of seeds in a breeding population. A wide variety of genetic markers for the analysis of genetic polymorphisms are available and known to those of skill in the art. The analysis may be used to select for genes, portions of genes, QTL, alleles, or genomic regions (haplotypes) that comprise or are linked to a genetic marker.
Herein, nucleic acid analysis methods are known in the art and include, but are not limited to, PCR-based detection methods (for example, TaqMan assays), microarray methods, and nucleic acid sequencing methods. In one embodiment, the detection of polymorphic sites in a sample of DNA, RNA, or cDNA may be facilitated through the use of nucleic acid amplification methods. Such methods specifically increase the concentration of polynucleotides that span the polymorphic site, or include that site and sequences located either distal or proximal to it. Such amplified molecules can be readily detected by gel electrophoresis, fluorescence detection methods, or other means.
A method of achieving such amplification employs the polymerase chain reaction (PCR) (Mullis et al. 1986 Cold Spring Harbor Symp. Quant. Biol. 51:263-273; European Patent 50,424; European Patent 84,796; European Patent 258,017; European Patent 237,362; European Patent 201,184; U.S. Pat. No. 4,683,202; U.S. Pat. No. 4,582,788; and U.S. Pat. No. 4,683,194), using primer pairs that are capable of hybridizing to the proximal sequences that define a polymorphism in its double-stranded form.
Polymorphisms in DNA sequences can be detected or typed by a variety of effective methods well known in the art including, but not limited to, those disclosed in U.S. Pat. Nos. 5,468,613 and 5,217,863; 5,210,015; 5,876,930; 6,030,787; 6,004,744; 6,013,431; 5,595,890; 5,762,876; 5,945,283; 5,468,613; 6,090,558; 5,800,944; 5,616,464, 7,312,039, 7,238,476, 7,297,485, 7,282,355, 7,270,981, and 7,250,252 all of which are incorporated herein by reference in their entireties. However, the compositions and methods of this invention can be used in conjunction with any polymorphism typing method to type polymorphisms in genomic DNA samples. These genomic DNA samples used include but are not limited to genomic DNA isolated directly from a plant, cloned genomic DNA, or amplified genomic DNA.
The present invention provides methods for the identification of nucleic acids of interest. In one embodiment, the one or more nucleic acid sequences can be associated with a marker, wherein the marker may be a genetic marker and the genetic marker may comprise some portion of the nucleic acid sequence of interest or be genetically linked to the nucleic acid sequence of interest. Further, an expression profile assay can be used to detect the genetic marker. In one aspect, microarrays can also be used for polymorphism detection, wherein oligonucleotide probe sets are assembled in an overlapping fashion to represent a single sequence such that a difference in the target sequence at one point would result in partial probe hybridization (Borevitz et al., Genome Res. 13:513-523 (2003); Cui et al., Bioinformatics 21:3852-3858 (2005). On any one microarray, it is expected there will be a plurality of target sequences, which may represent genes and/or noncoding regions wherein each target sequence is represented by a series of overlapping oligonucleotides, rather than by a single probe. This platform provides for high throughput screening a plurality of polymorphisms. A single-feature polymorphism (SFP) is a polymorphism detected by a single probe in an oligonucleotide array, wherein a feature is a probe in the array. Typing of target sequences by microarray-based methods is disclosed in U.S. Pat. Nos. 6,799,122; 6,913,879; and 6,996,476.
For the purpose of QTL mapping, the markers included should be diagnostic of origin in order for inferences to be made about subsequent populations. SNP markers are ideal for mapping because the likelihood that a particular SNP allele is derived from independent origins in the extant populations of a particular species is very low. As such, SNP markers are useful for tracking and assisting introgression of QTLs, particularly in the case of haplotypes.
As used herein, a “nucleic acid molecule,” be it a naturally occurring molecule or otherwise may be “substantially purified”, if desired, referring to a molecule separated from substantially all other molecules normally associated with it in its native state. More preferably a substantially purified molecule is the predominant species present in a preparation. A substantially purified molecule may be greater than 60% free, preferably 75% free, more preferably 90% free, and most preferably 95% free from the other molecules (exclusive of solvent) present in the natural mixture. The term “substantially purified” is not intended to encompass molecules present in their native state.
The agents of the present invention will preferably be “biologically active” with respect to either a structural attribute, such as the capacity of a nucleic acid to hybridize to another nucleic acid molecule, or the ability of a protein to be bound by an antibody (or to compete with another molecule for such binding). Alternatively, such an attribute may be catalytic, and thus involve the capacity of the agent to mediate a chemical reaction or response.
The agents of the present invention may also be recombinant. As used herein, the term recombinant means any agent (e.g. DNA, peptide etc.), that is, or results, however indirect, from human manipulation of a nucleic acid molecule.
The agents of the present invention may be labeled with reagents that facilitate detection of the agent (e.g. fluorescent labels (Prober et al. 1987 Science 238:336-340; European Patent 144914), chemical labels (U.S. Pat. No. 4,582,789; U.S. Pat. No. 4,563,417), and modified bases (European Patent 119448).
The present invention provides methods for identification of loci associated with yield using mapping techniques. By establishing performance as a phenotype, genotypes associated with preferred performance are identified. The methods of the present invention are useful for comparing two or more germplasm entries. Exemplary methods for the detection of marker-trait associations are set forth below.
Because of allelic differences in genetic markers, QTL can be identified by statistical evaluation of the genotypes and phenotypes of segregating populations. Processes to map QTL are well-described (WO 90/04651; U.S. Pat. Nos. 5,492,547, 5,981,832, 6,455,758; reviewed in Flint-Garcia et al. 2003 Ann. Rev. Plant Biol. Ann. Rev. Plant Biol. 54:357-374). Methods for determining the statistical significance of a correlation between a phenotype and a genotype, whether a genetic marker or haplotype, may be determined by any statistical test known in the art and with any accepted threshold of statistical significance being required. The application of particular methods and thresholds of significance are well within the skill of the ordinary practitioner of the art. Notably, any type of marker can be correlated with the causative genotype and selection decisions can be made based on a genetic or phenotypic marker.
Using markers to infer a phenotype of interest results in the economization of a breeding program by substituting costly, time-intensive phenotyping with genotyping or a cheaper phenotyping platform, such as an early emerging phenotypic character. Further, breeding programs can be designed to explicitly drive the frequency of specific, favorable phenotypes by targeting particular genotypes (U.S. Pat. No. 6,399,855). Fidelity of these associations may be monitored continuously to ensure maintained predictive ability and, thus, informed breeding decisions (US Patent Application 2005/0015827).
An allele of a QTL can comprise multiple genes or other genetic factors even within a contiguous genomic region or linkage group, such as a haplotype. As used herein, an allele of a QTL or transgene modulating locus can therefore encompass more than one gene or other genetic factor where each individual gene or genetic component is also capable of exhibiting allelic variation and where each gene or genetic factor is also capable of eliciting a phenotypic effect on the quantitative trait in question. In an aspect of the present invention, the allele of a QTL comprises one or more genes or other genetic factors that are also capable of exhibiting allelic variation. The use of the term “an allele of a QTL” is thus not intended to exclude a QTL that comprises more than one gene or other genetic factor. Specifically, an “allele of a QTL” in the present invention can denote a haplotype within a haplotype window wherein a phenotype can be disease resistance. A haplotype window is a contiguous genomic region that can be defined, and tracked, with a set of one or more polymorphic markers wherein the polymorphisms indicate identity by descent. A haplotype within that window can be defined by the unique fingerprint of alleles at each marker. As used herein, an allele is one of several alternative forms of a gene occupying a given locus on a chromosome. When all the alleles present at a given locus on a chromosome are the same, that plant is homozygous at that locus. If the alleles present at a given locus on a chromosome differ, that plant is heterozygous at that locus. Plants of the present invention may be homozygous or heterozygous at any particular transgene modulating locus or for a particular polymorphic marker.
The identification of marker-trait associations has evolved to the application of genetic markers as a tool for the selection of “new and superior plants” via introgression of preferred genomic regions as determined by statistical analyses (U.S. Pat. No. 6,219,964). Marker-assisted introgression involves the transfer of a chromosomal region, defined by one or more markers, from one germplasm to a second germplasm. The initial step in that process is the localization of the genomic region or transgene by gene mapping, which is the process of determining the position of a gene or genomic region relative to other genes and genetic markers through linkage analysis. The basic principle for linkage mapping is that the closer together two genes are on a chromosome, the more likely they are to be inherited together. Briefly, a cross is generally made between two genetically compatible but divergent parents relative to the traits of interest. Genetic markers can then be used to follow the segregation of these traits in the progeny from the cross, often a backcross (BC1), F2, or recombinant inbred population.
For the purpose of QTL mapping, the markers included should be diagnostic of origin in order for inferences to be made about subsequent populations. SNP markers are ideal for mapping because the likelihood that a particular SNP allele is derived from independent origins in the extant populations of a particular species is very low. As such, SNP markers are useful for tracking and assisting introgression of QTLs, particularly in the case of haplotypes.
In plant breeding populations, linkage disequilibrium (LD) is the level of departure from random association between two or more loci in a population and LD often persists over large chromosomal segments. Although it is possible for one to be concerned with the individual effect of each gene in the segment, for a practical plant breeding purpose the emphasis is typically on the average impact the region has for the trait(s) of interest when present in a line, hybrid or variety. The amount of pair-wise LD is calculated (using the r2 statistic) against the distance in centiMorgan (cM, one hundredth of a Morgan, on average one recombination per meiosis, recombination is the result of the reciprocal exchange of chromatid segments between homologous chromosomes paired at meiosis, and it is usually observed through the association of alleles at linked loci from different grandparents in the progeny) using a set of genetic markers and set of germplasm entries.
The genetic linkage of additional genetic marker molecules can be established by a gene mapping model such as, without limitation, the flanking marker model reported by Lander et al. (Lander et al. 1989 Genetics, 121:185-199), and the interval mapping, based on maximum likelihood methods described therein, and implemented in the software package MAPMAKER/QTL (Lincoln and Lander, Mapping Genes Controlling Quantitative Traits Using MAPMAKER/QTL, Whitehead Institute for Biomedical Research, Massachusetts, (1990). Additional software includes Qgene, Version 2.23 (1996), Department of Plant Breeding and Biometry, 266 Emerson Hall, Cornell University, Ithaca, N.Y.). Use of Qgene software is a particularly preferred approach.
A maximum likelihood estimate (MLE) for the presence of a genetic marker is calculated, together with an MLE assuming no QTL effect, to avoid false positives. A log10 of an odds ratio (LOD) is then calculated as: LOD=log10 (MLE for the presence of a QTL/MLE given no linked QTL). The LOD score essentially indicates how much more likely the data are to have arisen assuming the presence of a QTL versus in its absence. The LOD threshold value for avoiding a false positive with a given confidence, say 95%, depends on the number of genetic markers and the length of the genome. Graphs indicating LOD thresholds are set forth in Lander et al. (1989), and further described by Arús and Moreno-González, Plant Breeding, Hayward, Bosemark, Romagosa (eds.) Chapman & Hall, London, pp. 314-331 (1993).
Additional models can be used. Many modifications and alternative approaches to interval mapping have been reported, including the use of non-parametric methods (Kruglyak et al. 1995 Genetics, 139:1421-1428). Multiple regression methods or models can be also be used, in which the trait is regressed on a large number of genetic markers (Jansen, Biometrics in Plant Breed, van Oijen, Jansen (eds.) Proceedings of the Ninth Meeting of the Eucarpia Section Biometrics in Plant Breeding, The Netherlands, pp. 116-124 (1994); Weber and Wricke, Advances in Plant Breeding, Blackwell, Berlin, 16 (1994)). Procedures combining interval mapping with regression analysis, whereby the phenotype is regressed onto a single putative QTL at a given genetic marker interval, and at the same time onto a number of genetic markers that serve as ‘cofactors,’ have been reported by Jansen et al. (Jansen et al. 1994 Genetics, 136:1447-1455) and Zeng (Zeng 1994 Genetics 136:1457-1468). Generally, the use of cofactors reduces the bias and sampling error of the estimated QTL positions (Utz and Melchinger, Biometrics in Plant Breeding, van Oijen, Jansen (eds.) Proceedings of the Ninth Meeting of the Eucarpia Section Biometrics in Plant Breeding, The Netherlands, pp. 195-204 (1994), thereby improving the precision and efficiency of QTL mapping (Zeng 1994). These models can be extended to multi-environment experiments to analyze genotype-environment interactions (Jansen et al. 1995 Theor. Appl. Genet. 91:33-3). Association study approaches such as transmission disequilibrium tests may be useful for detecting marker-trait associations (Stich et al. 2006 Theor. Appl. Genet. 113:1121-1130).
An alternative to traditional QTL mapping involves achieving higher resolution by mapping haplotypes, versus individual genetic markers (Fan et al. 2006 Genetics 172:663-686). This approach tracks blocks of DNA known as haplotypes, as defined by polymorphic genetic markers, which are assumed to be identical by descent in the mapping population. This assumption results in a larger effective sample size, offering greater resolution of QTL. Methods for determining the statistical significance of a correlation between a phenotype and a genotype, in this case a haplotype, may be determined by any statistical test known in the art and with any accepted threshold of statistical significance being required. The application of particular methods and thresholds of significance are well with in the skill of the ordinary practitioner of the art.
Selection of appropriate mapping populations is important to map construction. The choice of an appropriate mapping population depends on the type of marker systems employed (Tanksley et al., Molecular mapping in plant chromosomes. chromosome structure and function: Impact of new concepts J. P. Gustafson and R. Appels (eds.). Plenum Press, New York, pp. 157-173 (1988)). Consideration must be given to the source of parents (adapted vs. exotic) used in the mapping population. Chromosome pairing and recombination rates can be severely disturbed (suppressed) in wide crosses (adapted×exotic) and generally yield greatly reduced linkage distances. Wide crosses will usually provide segregating populations with a relatively large array of polymorphisms when compared to progeny in a narrow cross (adapted×adapted).
An F2 population is the first generation of selfing after the hybrid seed is produced. Usually a single F1 plant is selfed to generate a population segregating for all the genes in Mendelian (1:2:1) fashion. Maximum genetic information is obtained from a completely classified F2 population using a codominant genetic marker system (Mather, Measurement of Linkage in Heredity: Methuen and Co., (1938)). In the case of dominant markers, progeny tests (e.g. F3, BCF2) are required to identify the heterozygotes, thus making it equivalent to a completely classified F2 population. However, this procedure is often prohibitive because of the cost and time involved in progeny testing. Progeny testing of F2 individuals is often used in map construction where phenotypes do not consistently reflect genotype (e.g. disease resistance) or where trait expression is controlled by a QTL. Segregation data from progeny test populations (e.g. F3 or BCF2) can be used in map construction. Marker-assisted selection can then be applied to cross progeny based on marker-trait map associations (F2, F3), where linkage groups have not been completely disassociated by recombination events (i.e., maximum disequilibrium).
Recombinant inbred lines (RIL) (genetically related lines; usually >F5, developed from continuously selfing F2 lines towards homozygosity) can be used as a mapping population. Information obtained from dominant markers can be maximized by using RIL because all loci are homozygous or nearly so. Under conditions of tight linkage (i.e., about <10% recombination), dominant and co-dominant genetic markers evaluated in RIL populations provide more information per individual than either marker type in backcross populations (Reiter et al. 1992 Proc. Natl. Acad. Sci. (USA) 89:1477-1481). However, as the distance between markers becomes larger (i.e., loci become more independent), the information in RIL populations decreases dramatically.
Backcross populations (e.g., generated from a cross between a successful variety (recurrent parent) and another variety (donor parent) carrying a trait not present in the former) can be utilized as a mapping population. A series of backcrosses to the recurrent parent can be made to recover most of its desirable traits. Thus a population is created consisting of individuals nearly like the recurrent parent but each individual carries varying amounts of genomic regions from the donor parent. Backcross populations can be useful for mapping dominant genetic markers if all loci in the recurrent parent are homozygous and the donor and recurrent parent have contrasting polymorphic marker alleles (Reiter et al. 1992 Proc. Natl. Acad. Sci. (USA) 89:1477-1481). Information obtained from backcross populations using either codominant or dominant markers is less than that obtained from F2 populations because one, rather than two, recombinant gametes are sampled per plant. Backcross populations, however, are more informative (at low marker saturation) when compared to RILs as the distance between linked loci increases in RIL populations (i.e. about 0.15% recombination). Increased recombination can be beneficial for resolution of tight linkages, but may be undesirable in the construction of maps with low marker saturation.
Near-isogenic lines (NIL) created by many backcrosses to produce an array of individuals that are nearly identical in genetic composition except for the trait or genomic region under interrogation can be used as a mapping population. In mapping with NILs, only a portion of the polymorphic loci are expected to map to a selected region.
Bulk segregant analysis (BSA) is a method developed for the rapid identification of linkage between genetic markers and traits of interest (Michelmore et al. 1991 Proc. Natl. Acad. Sci. (U.S.A.) 88:9828-9832). In BSA, two bulked DNA samples are drawn from a segregating population originating from a single cross. These bulks contain individuals that are identical for a particular trait (resistant or susceptible to particular disease) or genomic region but arbitrary at unlinked regions (i.e. heterozygous). Regions unlinked to the target region will not differ between the bulked samples of many individuals in BSA.
In another embodiment, plants can be screened for one or more markers associated using high throughput, non-destructive seed sampling, wherein one or more of the markers is associated with at least one transgene modulating locus. In a preferred aspect, seed is sampled in this manner and only seed with at least one genotype of interest is advanced. Apparatus and methods for the high-throughput, non-destructive sampling of seeds have been described which would overcome the obstacles of statistical samples by allowing for individual seed analysis. For example, U.S. patent application Ser. No. 11/213,430 (filed Aug. 26, 2005); U.S. patent application Ser. No. 11/213,431 (filed Aug. 26, 2005); U.S. patent application Ser. No. 11/213,432 (filed Aug. 26, 2005); U.S. patent application Ser. No. 11/213,434 (filed Aug. 26, 2005); and U.S. patent application Ser. No. 11/213,435 (filed Aug. 26, 2005), U.S. patent application Ser. No. 11/680,611 (filed Mar. 2, 2007), which are incorporated herein by reference in their entirety, disclose apparatus and systems for the automated sampling of seeds as well as methods of sampling, testing and bulking seeds.
Plants of the present invention can be part of or generated from a breeding program. The choice of breeding method depends on the mode of plant reproduction, the heritability of the trait(s) being improved, and the type of cultivar used commercially (e.g., F1 hybrid cultivar, pureline cultivar, etc). A cultivar is a race or variety of a plant species that has been created or selected intentionally and maintained through cultivation.
The present invention provides for parts of the plants of the present invention.
Selected, non-limiting approaches for breeding the plants of the present invention are set forth below. A breeding program can be enhanced using marker assisted selection (MAS) on the progeny of any cross. It is understood that nucleic acid markers of the present invention can be used in a MAS (breeding) program. It is further understood that any commercial and non-commercial cultivars can be utilized in a breeding program. Factors such as, for example, emergence vigor, vegetative vigor, stress tolerance, disease resistance, branching, flowering, seed set, seed size, seed density, standability, and threshability etc. will generally dictate the choice.
In one aspect, MAB programs use a plurality of markers to identify higher performing selections that have, on average, a higher frequency of favorable alleles at one or more loci. Fingerprinting was developed to determine the genome-wide marker distribution. Using the resulting marker distance and/or marker similarities indices between two or more lines, it is possible to build pedigrees and to calculate the breeding value across all assessed loci. Herein, breeding values are calculated based on expression profile effect estimates and expression profile (i.e., allele) frequency, wherein the expression profile breeding value represents the effect of fixing a particular nucleic acid sequence (i.e., allele) underlying the expression profile in a population, thus providing the basis for ranking nucleic acid sequences, based on corresponding expression profiles.
For highly heritable traits, a choice of superior individual plants evaluated at a single location will be effective, whereas for traits with low heritability, selection should be based on mean values obtained from replicated evaluations of families of related plants. Popular selection methods commonly include pedigree selection, modified pedigree selection, mass selection, and recurrent selection. In a preferred aspect, a backcross or recurrent breeding program is undertaken.
The complexity of inheritance influences choice of the breeding method. Backcross breeding can be used to transfer one or a few favorable genes for a highly heritable trait into a desirable cultivar. This approach has been used extensively for breeding disease-resistant cultivars. Various recurrent selection techniques are used to improve quantitatively inherited traits controlled by numerous genes.
Breeding lines can be tested and compared to appropriate standards in environments representative of the commercial target area(s) for two or more generations. The best lines are candidates for new commercial cultivars; those still deficient in traits may be used as parents to produce new populations for further selection.
For hybrid crops, the development of new elite hybrids requires the development and selection of elite inbred lines, the crossing of these lines and selection of superior hybrid crosses. The hybrid seed can be produced by manual crosses between selected male-fertile parents or by using male sterility systems. Additional data on parental lines, as well as the phenotype of the hybrid, influence the breeder's decision whether to continue with the specific hybrid cross.
Pedigree breeding and recurrent selection breeding methods can be used to develop cultivars from breeding populations. Breeding programs combine desirable traits from two or more cultivars or various broad-based sources into breeding pools from which cultivars are developed by selfing and selection of desired phenotypes. New cultivars can be evaluated to determine which have commercial potential.
Backcross breeding has been used to transfer genes for a simply inherited, highly heritable trait into a desirable homozygous cultivar or inbred line, which is the recurrent parent. The source of the trait to be transferred is called the donor parent. After the initial cross, individuals possessing the phenotype of the donor parent are selected and repeatedly crossed (backcrossed) to the recurrent parent. The resulting plant is expected to have most attributes of the recurrent parent (e.g., cultivar) and, in addition, the desirable trait transferred from the donor parent.
The single-seed descent procedure in the strict sense refers to planting a segregating population, harvesting a sample of one seed per plant, and using the one-seed sample to plant the next generation. When the population has been advanced from the F2 to the desired level of inbreeding, the plants from which lines are derived will each trace to different F2 individuals. The number of plants in a population declines each generation due to failure of some seeds to germinate or some plants to produce at least one seed. As a result, not all of the F2 plants originally sampled in the population will be represented by a progeny when generation advance is completed.
The doubled haploid (DH) approach achieves isogenic plants in a shorter time frame. DH plants provide an invaluable tool to plant breeders, particularly for generating inbred lines and quantitative genetics studies. For breeders, DH populations have been particularly useful in QTL mapping, cytoplasmic conversions, and trait introgression. Moreover, there is value in testing and evaluating homozygous lines for plant breeding programs. All of the genetic variance is among progeny in a breeding cross, which improves selection gain.
Most research and breeding applications rely on artificial methods of DH production. The initial step involves the haploidization of the plant which results in the production of a population comprising haploid seed. Non-homozygous lines are crossed with an inducer parent, resulting in the production of haploid seed. Seed that has a haploid embryo, but normal triploid endosperm, advances to the second stage. That is, haploid seed and plants are any plant with a haploid embryo, independent of the ploidy level of the endosperm.
After selecting haploid seeds from the population, the selected seeds undergo chromosome doubling to produce doubled haploid seeds. A spontaneous chromosome doubling in a cell lineage will lead to normal gamete production or the production of unreduced gametes from haploid cell lineages. Application of a chemical compound, such as colchicine, can be used to increase the rate of diploidization. Colchicine binds to tubulin and prevents its polymerization into microtubules, thus arresting mitosis at metaphase, can be used to increase the rate of diploidization, i.e. doubling of the chromosome number These chimeric plants are self-pollinated to produce diploid (doubled haploid) seed. This DH seed is cultivated and subsequently evaluated and used in hybrid testcross production.
Descriptions of other breeding methods that are commonly used for different traits and crops can be found in one of several reference books (Allard, “Principles of Plant Breeding,” John Wiley & Sons, NY, U. of CA, Davis, Calif., 50-98, 1960; Simmonds, “Principles of crop improvement,” Longman, Inc., NY, 369-399, 1979; Sneep and Hendriksen, “Plant breeding perspectives,” Wageningen (ed), Center for Agricultural Publishing and Documentation, 1979; Fehr, In: Soybeans: Improvement, Production and Uses, 2nd Edition, Monograph., 16:249, 1987; Fehr, “Principles of variety development,” Theory and Technique, (Vol. 1) and Crop Species Soybean (Vol. 2), Iowa State Univ., Macmillan Pub. Co., NY, 360-376, 1987).
In a preferred aspect of the present invention, an expression profile is determined for at least one locus in at least one plant genome and the expression profile is associated with at least one phenotypic trait. The material evaluated can include at least one inbred plant, at least one hybrid plant, and at least one tester. In certain aspects, at least two alleles for at least one locus are evaluated. In other aspects, at least two expression variants for at least one allele are evaluates. Methods for calculating associations between two or more data sets are known in the art. For example, in multiple regression analysis, a linear equation is fit so as to describe the response of one or more dependent variables to variation in a set of explanatory variables. In another aspect, ordinary least-squares regression can be used, wherein the objective is to minimize the prediction error of response in the sample of response and explanatory variables under immediate consideration. Another method, partial least-squares (PLS) regression, has the objective of minimizing prediction error of response when the linear prediction equation is applied to a new sample drawn from the same population from which the original sample was obtained.
When the explanatory variables are highly correlated, least-squares regression performs very poorly in prediction of dependent variable response in future samples. Partial least-squares multiple regression overcomes this deficiency by focusing on directions in the predictor space of response and explanatory variables that are well sampled. The well-sampled directions are determined by factorization of the covariance matrix of the response and explanatory variables. Cross validation is then employed to determine which factors are best sampled and, therefore, are the best predictors of response.
In another aspect, the partial least-squares technique can be used. The first applications of PLS regression were to problems in fitting calibrated response curves of chemical elements in a sample to highly collinear spectral emission data. A similar problem arises when one attempts to fit observed phenotypic variation in yield to variation in gene expression or variation in marker genotype. Gene expression or marker variables can, by their very nature, be highly collinear due to linkage disequilibrium between loci physically linked along the chromosomes. In addition, many more gene expression and marker variables are usually available than can be accommodated by any reasonable sample of observations. For ordinary least-squares, this means that the normal equations will be comprised of a set of consistent equations that have an infinite number of solutions, any one of which is perfect for the sample at hand, but of absolutely no use in predicting response in future samples. PLS regression circumvents this problem by identifying a very few highly predictive factors, many fewer than the number of observations in the sample.PLS regression has been shown to perform very well on small samples with very large numbers of explanatory variables (Boulesteix and Strimmer 2006 Briefings in Bioinformatics 8: 32-4). Consequently, it was a natural choice for exploring the relationships between phenotypic variation in yield with variation in gene expression and marker genotype.
Once the predictive factors are identified, the regression coefficients on these factors are transformed back into the space of the original explanatory variables. Then, selection among the original variables to attain an even more predictive model can be accomplished by calculation of the VIP (Variable Importance for Prediction) statistic. It can be shown, through calculation of the empirical distribution of the statistic, that a cutoff value of VIP=0.8 is generally appropriate for explanatory variable selection. That is, when the cutoff is imposed, only explanatory variables with VIP values equal to or greater than 0.8 are retained in the regression model.
A number of versions of PLS regression are known. One of the most useful is the SIMPLS routine (dejong 1993 Chemometrics and Intelligent Laboratory Systems 18:251-263). The utility of this version is due to the fact that different models constituted with different explanatory variables are directly comparable. In accordance with this benefit, the SIMPLS version of SAS Proc PLS was employed in all PLS regression applications described below.
Gain from selection on a trait is dependent on selection intensity, the amount of genetic variability for the trait in the population under selection, and the repeatability of the trait over diverse environmental conditions. Repeatability of the trait is measured as the ratio of genetic variation to phenotypic variation, where phenotypic variation is the sum total of genetic and environmental variation.
Selection can be applied simultaneously to multiple traits. For multiple traits selection, the equation for expected genetic advance, or gain, is, in matrix notation
ΔH=k(B´Gw)1/2
where
B=P
−1
Gw
and
k=selection intensity
G=the genotypic covariance matrix among traits
P=the phenotypic covariance matrix among traits
w=weight vector with elements assigned according to the importance of the traits.
Multiple trait selection as described by the equations above is called index selection (Falconer 1960 Introduction to quantitative genetics. Ronald Press, NY), with the index determined by the elements of the w vector assigned in accordance with the breeder's estimation of the importance of the traits.
Genetic and phenotypic correlations are derived from the G and P matrices, respectively, while the repeatabilities (also called heritabilities) are obtained from the ratios of the G and P matrix diagonal elements.
For selection on a single trait, the G and P matrices and the B vector devolve into scalar quantities, while the w vector becomes just a scalar equal to 1.
For correlated response in one trait to selection on the others, the G matrix becomes a row vector corresponding to the response trait, the element of the w vector corresponding to the response trait is set equal to zero while the elements corresponding to the traits under selection are set equal to one, and the P matrix contains only rows and columns corresponding to the traits under selection.
To examine the efficacy of including probeset and marker models of predicted yield in selection for yield, three scenarios may be envisioned: (1) selection on observed yield only; (2) correlated response in yield to selection on probeset or marker model predicted yield; (3) index selection in which the element of the weight matrix corresponding to observed yield is set equal to 1 and the element corresponding to model predicted yield is set equal to zero. Scenario (2) is often referred to as indirect selection, and scenario (3) is known as index selection with a secondary trait. The secondary trait is given a weight of zero in the index since it of itself has no value, but if its repeatability is high and its genetic correlation with the primary trait is substantial, it can be an effective aid in selection for the primary trait.
Given finished sequence of the maize genome, one could first identify markers proximal to eQTL but in trans with respect to a probeset significantly associated with a phenotypic trait, locate the marker on a BAC or contig, scan the region for open reading frames (ORF), blast sequence from the ORFs against gene data bases for possible function, and then lay down candidate oligos on a new chip for a second round of discovery. Obviously, the cycle could be extended to three or more rounds.
The methods of the present invention take advantage of recent technology developments in the field of expression profiling assays, e.g., U.S. Pat. No. 7,163,792, U.S. Pat. No. 7,081,339, U.S. Pat. No. 6,996,476, U.S. Pat. No. 6,548,257, US Patent Application 20070172854, US Patent Application 20070172831, US Patent Application 20070148690, US Patent Application 20070037189, and US Patent Application 20070003953.
As provided by the present invention, the knowledge of nucleic acid sequences as inferred by expression profiles targeting the nucleic acid sequences can be applied to make decisions at multiple stages of the breeding program:
a) Among segregating progeny, as a pre-selection method, to increase the selection index and drive the frequency of favorable nucleic acid sequences among breeding populations, wherein pre-selection is defined as selection among offspring of a breeding cross based on the genotype of these progenies at a selected set of two or more nucleic acid sequences, and leveraging of expression profile-trait associations identified in previous breeding crosses.
b) Among segregating progeny from a breeding population, to increase the frequency of the favorable nucleic acid sequences for the purpose of line or variety development.
c) Among segregating progeny from a breeding population, to increase the frequency of the favorable nucleic acid sequences prior to QTL mapping within this breeding population.
d) For hybrid crops, among parental lines from different heterotic groups to predict the performance potential of different hybrids.
In another embodiment, the present invention provides a method for improving plant germplasm by accumulation of nucleic acid sequences of interest in a germplasm comprising determining expression profiles for at least two loci in the genome of a species of plant, and associating the expression profiles with at least one trait, and using this expression profile effect estimates to direct breeding decisions. These expression profile effect estimates can be derived using historical expression profile-trait associations or de novo from mapping populations. The expression profile effect estimates for one or more traits provide the basis for making decisions in a breeding program. This invention also provides an alternative basis for decision-making using breeding value calculations based on the estimated effect and frequency of the underlying nucleic acid sequences in the germplasm. These breeding values can be used to rank a specified set of nucleic acid sequences. In the context of the specified set of nucleic acid sequences, these breeding values form the basis for calculating an index to rank the alleles both within and between loci.
For example, any given chromosome segment can be represented in a given population by a number of nucleic acid sequences that can vary from 1 (region is fixed), to the size of the population times the ploidy level of that species (2 in a diploid species), in a population in which every chromosome has a different nucleic acid sequence. Identity-by-descent among nucleic acid sequences carried by multiple individuals in a non-fixed population will result in an intermediate number of different nucleic acid sequences and possibly a differing frequency among the different nucleic acid sequences. New nucleic acid sequences may arise, through recombination at meiosis between existing nucleic acid sequences in heterozygous progenitors. The frequency of each nucleic acid sequence may be estimated by several means known to one versed in the art (e.g. by direct counting, or by using an EM algorithm). Let us assume that “k” different nucleic acid sequences, wherein a nucleic acid sequence represents at least one nucleotide and may constitute an allele or haplotype, identified as “ni” (i=1, . . . , k), are known, that their frequency in the population is “fi” (i=1, . . . , k), and for each of these nucleic acid sequences we have an effect estimate “Esti” (i=1, . . . , k). If we call the “breeding value” (BVi) the effect on that population of fixing that nucleic acid sequence, then this breeding value corresponds to the change in mean for the trait(s) of interest of that population between its original state of haplotypic distribution at the window and a final state at which nucleic acid sequence “ni” encounters itself at a frequency of 100%. The breeding value estimate of ni in this population can be calculated as:
One skilled in the art will recognize that nucleic acid sequences that are rare in the population in which effects are estimated tend to be less precisely estimated, this difference of confidence may lead to adjustment in the calculation. For example one can ignore the effects of rare nucleic acid sequences, by calculating breeding value of better known nucleic acid sequence after adjusting the frequency of these (by dividing it by the sum of frequency of the better known nucleic acid sequences). One could also provide confidence intervals for the breeding value of each nucleic acid sequences.
This breeding value will change according to the population for which it is calculated, as a function of difference of nucleic acid sequence frequencies. The term population can then assume different meanings, below are two examples of special cases. First, it can be a single inbred line in which one intend to replace its current nucleic acid sequence nj by a new nucleic acid sequence ni, in this case BVi=Esti-Estj. Second, it can be a F2 population in which the two parental nucleic acid sequence ni and nj are originally present in equal frequency (50%), in which case BVi=½ (Esti-Estj).
These statistical approaches enable expression profile effect estimates to inform breeding decisions in multiple contexts. Other statistical approaches to calculate breeding values are known to those skilled in the art and can be used in substitution without departing from the spirit and scope of this invention.
Further, methods for determining the statistical significance of a correlation between a phenotype and a genotype, in this case an expression profile, may be determined by any statistical test known in the art and with any accepted threshold of statistical significance being required. The application of particular methods and thresholds of significance are well with in the skill of the ordinary practitioner of the art.
Expression profile effect estimates and/or breeding values for one or more traits of interest provide the basis for determining one or more nucleic acid sequences of interest in comparisons of two or more expression profiles. With this a priori information, breeding selections are conducted on a nucleic acid sequence basis, wherein a first plant is crossed with a second plant that contains at least one nucleic acid sequence that is different from the first plant nucleic acid sequence or nucleic acid sequences; and at least one progeny plant is selected by detecting the nucleic acid sequence or set of nucleic acid sequences of the first plant, wherein the progeny plant comprises in its genome one or more nucleic acid sequences of interest of the first plant and at least one nucleic acid sequence of interest of the second plant; and the progeny plant is used in activities related to germplasm improvement, herein defined as including using the plant for line and variety development, hybrid development, transgenic event selection, making breeding crosses, testing and advancing a plant through self fertilization, using plant or parts thereof for transformation, using plants or parts thereof for candidates for expression constructs, and using plant or parts thereof for mutagenesis.
In one aspect, this invention uses expression profiles to identify nucleic acids, in one or more regions of a plant genome, that provide a basis to compare two or more germplasm entries. Regions of contiguous nucleic acid sequence are indicative of the conservation of genetic identity of all intervening genes from a common progenitor. In cases where conserved sequence segments are coincident with segments in which QTL have been identified it is possible to deduce with high probability that QTL inferences can be extrapolated to other germplasm having an identical sequence in that locus. This a priori information provides the basis to select for favorable QTLs prior to QTL mapping within a given population.
For example, plant breeding decisions could comprise:
a) Selection among new breeding populations to determine which populations have the highest frequency of favorable nucleic acid sequences, wherein sequences are designated as favorable based on coincidence with previous QTL mapping; or
b) Selection of progeny containing said favorable nucleic acid sequences in breeding populations prior to, or in substitution for, QTL mapping within that population, wherein selection could be done at any stage of breeding and could also be used to drive multiple generations of recurrent selection; or
c) Prediction of progeny performance for specific breeding crosses; or
d) Selection of lines for germplasm improvement activities based on said favorable nucleic acid sequences, including line development, hybrid development, selection among transgenic events based on the breeding value of the haplotype that the transgene was inserted into, making breeding crosses, testing and advancing a plant through self fertilization, using plant or parts thereof for transformation, using plants or parts thereof for candidates for expression constructs, and using plant or parts thereof for mutagenesis.
An additional unique aspect of this invention is the ability to select for specific genes or gene alleles, as they are targeted by expression profile assays. For example, in cases where the target nucleic acid sequence is coincident with segments in which genes have been identified it is possible to deduce with high probability that gene inferences can be extrapolated to other germplasm having an identical genotype in that locus. This a priori information provides the basis to select for favorable genes or gene alleles on the basis of an expression profile assay comprising the target nucleic acid within a given population.
For example, plant breeding decisions could comprise:
a) Selection among new breeding populations to determine which populations have the highest frequency of favorable nucleic acid sequences, wherein sequences are designated as favorable based on coincidence with previous gene mapping; or
b) Selection of progeny containing said favorable nucleic acid sequences in breeding populations, wherein selection is effectively enabled at the gene level, wherein selection could be done at any stage of inbreeding and could also be used to drive multiple generations of recurrent selection; or
c) Prediction of progeny performance for specific breeding crosses; or
d) Selection of lines for germplasm improvement activities based on said favorable nucleic acid sequences, including line development, hybrid development, selection among transgenic events based on the breeding value of the haplotype that the transgene was inserted into, making breeding crosses, testing and advancing a plant through self fertilization, using plant or parts thereof for transformation, using plants or parts thereof for candidates for expression constructs, and using plant or parts thereof for mutagenesis.
Further, in another preferred embodiment of this invention, the a priori information on the frequency of favorable nucleic acid sequences, as identified by an expression profile assay, in breeding populations enables pre-selection. That is, parental lines are selected based on the historical genotype-phenotype association information, wherein the genotype comprises an expression profile, for the purpose of driving favorable nucleic acid frequency for multiple traits simultaneously. In pre-selection, breeders can predict the phenotypic contribution for multiple traits of any line based on that line's fingerprint information, which corresponds to a composition of pre-defined expression profiles and the corresponding nucleic acid sequences. This multi-trait sequence selection approach economizes a breeding program by initiating selection at the initial stage of choosing parental crosses and it also reduces the need for costly, time-consuming phenotyping of progeny.
A preferred sequence provides a preferred property to a parent plant and to the progeny of the parent when selected by a marker means or phenotypic means. The method of the present invention provides for selection of preferred sequences, or sequences of interest, and the accumulation of these sequences in a breeding population.
In another embodiment, this invention enables indirect selection through selection decisions for at least one nucleic acid sequence based on at least one expression profile effect estimate such that additional phenotypes are indirectly selected upon due to the additional expression profile effect estimates for other phenotypic traits.
Another preferred embodiment of the present invention is to build additional value by selecting a composition of nucleic acid sequences wherein each corresponding expression profile has an estimated associated phenotype that is not negative with respect to yield, or is not positive with respect to maturity, or is null with respect to maturity, or amongst the best 50 percent with respect an agronomic trait, transgene, and/or a multiple trait index when compared to any other nucleic acid sequence at the same locus in a set of germplasm, or amongst the best 50 percent with respect to an agronomic trait, transgene, and/or a multiple trait index when compared to any other loci across the entire genome in a set of germplasm, or the nucleic acid sequence being present with a frequency of 75 percent or more in a breeding population or a set of germplasm can be taken as evidence of its high value, or any combination of these.
This invention anticipates a stacking of nucleic acid sequences from at least two loci into plants or lines by crossing parent plants or lines containing different nucleic acid sequences with different corresponding expression profiles, that is, different genotypes. The value of the plant or line comprising in its genome stacked nucleic acid sequences from two or more loci can be estimated by a composite breeding value, which depends on a combination of the value of the traits and the value of the nucleic acid sequence(s) to which the traits are linked. The present invention further anticipates that the composite breeding value of a plant or line can be improved by modifying the components of one or each of the nucleic acid sequences. Additionally, the present invention anticipates that additional value can be built into the composite breeding value of a plant or line by selection of at least one recipient nucleic acid sequence with a preferred expression profile effect estimate or, in conjunction with the frequency of said nucleic acid sequence in the germplasm pool, breeding value to which one or any of the other nucleic acid sequences are linked, or by selection of plants or lines for stacking two or more nucleic acid sequences from two or more loci by breeding.
Another embodiment of this invention is a method for enhancing breeding populations by accumulation of one or more nucleic acid sequences in one or more loci, in a germplasm. Loci include genetic information and provide phenotypic traits to the plant. Variations in the genetic information can result in variation of the phenotypic trait and the value of the phenotype can be measured. The genetic mapping of the nucleic acid sequences allows for a determination of linkage across sequences. The nucleic acid sequence of interest is novel in the genome of the progeny plant and can in itself serve as a genetic marker of a locus of interest. Notably, this nucleic acid sequence can also be used as an identifier for a gene or QTL. For example, in the event of multiple traits or trait effects associated with the nucleic acid sequence, only one marker would be necessary for selection purposes. Additionally, the locus of interest may provide a means to select for plants that have the linked locus.
In another embodiment, the present invention provides methods and compositions for biomarkers. Those markers might be useful for optimizing screening conditions. A biomarker is a characteristic that can be used as an indicator of biologic processes or responses to internal and external interventions. Herein, characteristic gene expression profiles can be related to different biological processes and stress responses, such as drought and nitrogen deficiency. The information is not only useful for understanding the underlying molecular mechanisms and identifying lead genes, but also can be used to develop biomarkers as indicators for various purposes. In this context, the TxP biomarkers under investigation are the genes with indicative expression profiles for the following aspects:
a. Biological pathways (e.g. ABA biosynthesis): these include key pathway genes that indicate a particular mode of action that a transgene may have, or a given stress type may affect.
b. Stress responses (e.g. nitrogen deficiency): These biomarkers may monitor the types and severity of stresses imposed on plants. They can be used for evaluation and optimization of screening conditions.
c. Transgene activity: They are indicators for cellular effect of transgenes and could be used in event evaluation and selection. They can also provide the possible MOA for the transgenes.
d. Physiology efficacy: They are indicators for physiology aspects of plants, which could replace for certain types of greenhouse or field assays.
e. Yield efficacy: They are predicators of yield efficacy with strong association with field test result for individual genes or gene groups under specific or general conditions. They could be used to aid event selection to increase the hit rate and save field testing cost and time.
f. Trait (e.g. oil content): These are the biomarkers associated with specific traits not necessarily in transgenic plants. They could be used as complement to genetic markers for breeding selection in given germplasm populations.
g. Consistency of Treatment: These are the biomarkers which evaluate stress and drought tolerance, monitoring of environmental microheterogeneity, defining and classification of microenvironments, providing increased precision in data analyses and to aid decisions for plant advancement.
The present invention also provides for the screening of progeny plants' loci of interest and using the expression profile effect estimate as the basis for selection for use in a breeding program to enhance the accumulation of preferred nucleic acid sequences.
Using this method, the present invention contemplates that nucleic acid sequences of interest are selected from a large population of plants. Additionally, these nucleic acid sequences can be used in the described breeding methods to accumulate other beneficial and preferred loci and maintain these in a breeding population to enhance the overall germplasm of the crop plant. Crop plants considered for use in the method include but are not limited to, corn, soybean, cotton, wheat, rice, canola, oilseed rape, sugar beet, sorghum, millet, alfalfa, forage crops, oilseed crops, grain crops, fruit crops, ornamental plants, vegetable crops, fiber crops, spice crops, nut crops, turf crops, sugar crops, beverage crops, tuber crops, root crops, and forest crops.
The present invention provides methods of use for a portable, isothermal nucleic acid detection technology. Applications include methods for nucleic acid screening during breeding. This genotyping capability can be used for pre-selection, progeny selection, and stacking of genomic traits. In particular, this is especially useful for genotyping “on the fly” wherein potential resource efficiency is possible by only growing to maturity plants comprising one or more preferred genotypes. In one aspect, seedlings are raised in a greenhouse, leaf tissue is sampled, and only preferred plants are further propagated, either in the greenhouse or transplanted to the field. In particular, this tool is used in a multi-season program (i.e., winter nursery) where facilities and resources are not conducive to a typical lab-based genotyping facility but this tool allows for genotypic data to be collected and subsequently populated in a database, allowing decision-making by the breeder regardless of his/her geographic location.
In still another embodiment, the present invention acknowledges that preferred nucleic acids identified by the methods presented herein may be advanced as candidate genes for inclusion in expression constructs, i.e., transgenes.
Nucleic acids underlying expression profiles of interest may be expressed in plant cells by operably linking them to a promoter functional in plants. Nucleic acids for proteins disclosed in the present invention can be expressed in plant cells by operably linked them to a promoter functional in plants Tissue specific and/or inducible promoters may be utilized for appropriate expression of a nucleic acid for a particular trait. The 3′ un-translated sequence, 3′ transcription termination region, or poly adenylation region means a DNA molecule linked to and located downstream of a structural polynucleotide molecule responsible for a trait and includes polynucleotides that provide polyadenylation signal and other regulatory signals capable of affecting transcription, mRNA processing or gene expression. The polyadenylation signal functions in plants to cause the addition of polyadenylate nucleotides to the 3′ end of the mRNA precursor. The polyadenylation sequence can be derived from the natural gene, from a variety of plant genes, or from T-DNA genes. A 5′ UTR that functions as a translation leader sequence is a DNA genetic element located between the promoter sequence and the coding sequence. The translation leader sequence is present in the fully processed mRNA upstream of the translation start sequence. The translation leader sequence may affect processing of the primary transcript to mRNA, mRNA stability or translation efficiency.
In another aspect, nucleic acids underlying expression profiles of interest may have their expression modified by double-stranded RNA-mediated gene suppression, also known as RNA interference (“RNAi”), which includes suppression mediated by small interfering RNAs (“siRNA”), trans-acting small interfering RNAs (“ta-siRNA”), or microRNAs (“miRNA”). Examples of RNAi methodology suitable for use in plants are described in detail in U.S. patent application publications 2006/0200878 and 2007/0011775.
Methods are known in the art for assembling and introducing constructs into a cell in such a manner that the nucleic acid molecule for a trait is transcribed into a functional mRNA molecule that is translated and expressed as a protein product. For the practice of the present invention, conventional compositions and methods for preparing and using constructs and host cells are well known to one skilled in the art, see for example, Molecular Cloning: A Laboratory Manual, 3rd edition Volumes 1, 2, and 3 (2000) J. F. Sambrook, D. W. Russell, and N. Irwin, Cold Spring Harbor Laboratory Press. Methods for making transformation constructs particularly suited to plant transformation include, without limitation, those described in U.S. Pat. Nos. 4,971,908, 4,940,835, 4,769,061 and 4,757,011, all of which are herein incorporated by reference in their entirety. Transformation methods for the introduction of expression units into plants are known in the art and include electroporation as illustrated in U.S. Pat. No. 5,384,253; microprojectile bombardment as illustrated in U.S. Pat. Nos. 5,015,580; 5,550,318; 5,538,880; 6,160,208; 6,399,861; and 6,403,865; protoplast transformation as illustrated in U.S. Pat. No. 5,508,184; and Agrobacterium-mediated transformation as illustrated in U.S. Pat. Nos. 5,635,055; 5,824,877; 5,591,616; 5,981,840; and 6,384,301.
In summary, this invention describes the novel combination of expression profile analysis and molecular breeding methodologies to enable the use of expression profile information to carry out molecular plant breeding. Taken together, this invention enables the plant breeder to use expression profile information in parent selection, progeny selection, choosing tester combinations, developing pedigrees, fingerprinting samples, screening for haplotype diversity, and for building databases of sequence associations to trait and performance data.
The present example provides the results and analyses used to identify transcript profiles and SNP markers associated with yield in maize. SEQ IDs are provided for the ORF corresponding to probesets associated with yield, wherein these nucleic acids are useful both as molecular markers for either SNP development or sequence-assisted breeding (U.S. Patent Application Ser. No. 60/942,707) and as nucleic acid expression constructs for use in transgenic breeding. Further, data are providing that support the value of these nucleic acid sequences for the prediction of yield, via regression analyses for probesets, expression QTL (eQTL) and combinations thereof.
The plant materials used were testcrosses of a set of maize inbred lines crossed to two testers. The inbred line set was comprised of 82 dihaploid progeny emanating from random crosses among a large number of parental inbreds belonging to the female heterotic group. The dihaploid progeny were comprised of sib-pair or single-individuals families, each emanating from a different cross. The testers were comprised of two inbred lines. The tester lines are unrelated to each other and to all of the lines in the dihaploid progeny set. Testcrosses of 80 dihaploid lines by inbred A, and two dihaploid lines by inbred B, were evaluated at three locations in central Iowa over two years. Ear leaf tissue was sampled from each hybrid at each location at anthesis, flash frozen in liquid nitrogen and stored in frozen condition. The leaf samples were then fine-ground and m-RNA extracted. The extracted m-RNA from each sample was labeled with a fluorescent dye and hybridized to an Affymetrix GeneChip containing 2204 probesets. Each probeset was comprised of 16 25-mer oligonucleotide probes, each probeset representing an expressed sequence tag (EST) from a unigene library. The 2204 probesets were, on the basis of earlier studies, or probeset annotation, thought to be representative of genes associated with yield variation.
To measure gene expression associated with a particular probeset in a sample in a TxP experiment, first mRNA is extracted from the sample, cRNA is transcribed, then sheared to an average size of approximately 100 bases, and then labeled with a fluor. The labeled sheared cRNA prep is then hybridized to the chip, and the level of hybridization as measured by fluorescence is recorded for each of the 1.3 million probes on the chip. Fluorescence intensity from each probe of the probeset was recorded, and the composite fluorescence intensity, estimated as the Tukey mean of all probes in the probeset, was calculated. Fluorescence intensity is assumed to be proportional to mRNA concentration in the sample and, hence, is an indicator of transcriptional activity of the gene represented by the probeset. Transcript activity is commonly referred to as gene transcript expression (TxP).
Testcrosses, to inbred B and in which ear leaf samples had been taken at anthesis, of sixty-five members of the dihaploid line set had been evaluated in year 1. These testcrosses also had been evaluated at three central Iowa locations. Leaf samples were collected and processed in year 1 in a manner identical to that in year 2.
Each of the dihaploid lines had been genotyped for a large number of SNP markers. Markers from 653 SNP loci were select for analysis in the study; SNP markers of the present invention are provided in US Patent Application 20060141495 and WO 2008/021225, both incorporated herein by reference in their entirety. The criteria for selection comprised: (1) minimal missing data and (2) informativeness.
All plots in both years were machine-harvested, and grain yield in bushels per acre adjusted to %15.5 grain moisture was recorded.
In the data structure of the study, the experimental design was unbalanced to some extent. Consequently, for further analysis, a line least-squares mean for yield, and for gene expression of each probeset, was calculated using SAS Proc GLM. Values of least-squares means are those which would be obtained if the experimental design was balanced. The model used for the calculation of the least-squares means was
y=location+line+residual
where y stands either for yield or for gene expression of a particular probeset.
Gene expression for each probeset can be regarded as a phenotypically observed trait, no differently than for yield. Hence, hereinafter, ‘probeset’ will variously be understood to mean the name of the probeset or the line least-squares mean for gene expression of the probeset. Which meaning is appropriate will be understood from the context. Also, hereinafter, ‘yield’ will be understood to mean the line least-squares mean of yield.
Simple single-factor regressions of yield on each probeset and yield on each SNP marker locus were calculated. In addition, Single-factor regressions of each probeset on each SNP marker locus were calculated.
The model for each regression was of the form
y=a+xb+residual
where y stands either for the yield or probeset expression, a is the intercept, x is either probeset expression or SNP genotype (coded as 1 for allele 1, −1 for allele 2, and 0 for missing), b is the regression coefficient, and residual is deviation from regression. An F test for significance of the regression coefficient was calculated for each regression analysis. Because, many regression analyses were calculated for yield and each probeset, false positives due to multiple testing needed to be taken into account. To guard against false positives, the false discovery rate (FDR) was calculated (Benjamini and Hochberg 1995) in addition to the F test for significance in each single-factor analysis. Because the results from the single-factor analyses are quite voluminous, they are not presented here but can be accessed from the SAS data sets Single_Factor_M and Single_Factor_P data sets.
A set of 68 probesets that, on the basis of single factor analysis of probeset expression on marker genotype, mapped very closely to one or more marker loci.
A summary of the results of PLS regression of yield on the 68 probesets appears in Table 1. The rsquare value for the model was 47%, meaning that variation in the 68 probesets explained 47% of the variation in yield. The inference, under the assumptions relevant to PLS regression, is that the model comprised of the 68 probesets is the most predictive model for prediction of yield in future samples which are drawn from the same population of testcrosses, and which are evaluated across a sample of environments drawn from the same population of environments typical of those in the present sample. Also reported is the variable importance predictor (VIP) as measured in SAS. This value indicated the value of the probeset as a predictor of yield. VIP values greater than 0.8 are considered indicative of strong predictive power.
The regression coefficients for the probesets listed were obtained from transformation of the regression coefficients from the underlying factor model. This factor model found to be most predictive of yield was comprised of just a single factor, suggesting a parsimonious relationship between yield and underlying gene expression variation. That is, the underlying biological relationship may be quite direct.
The results of PLS regression of yield on SNP markers per se are given in Table 2. The results for markers are remarkably similar to those for probesets. Again, a model with only a single underlying factor was found to be most predictive of yield. Of the total of 653 markers, 24 markers comprised the final model in which variation among the marker genotypes accounted to 48% of the variation in yield.
Approximate map positions of the 68 probesets most predictive of yield appear in Table 3. The map positions were arrived at by assuming that the probeset locus would be proximal to the marker locus having the lowest p-value for type I error in the single-factor regressions of probesets on markers. If this assumption is valid, precision of placement of the probeset will then be proportional to the size of the corresponding p-value.
Similarly to the PLS regression of yield on marker genotypes, PLS regression of probeset on marker genotypes was calculated for each of the 68 probesets found to be most predictive of yield. As with yield, the final model for each probeset contained a reduced number of marker loci. The retained marker loci were then assumed to be proximal to an expression QTL (eQTL) for the probeset, facilitating construction of an eQTL map for each probeset (Table 4).
Next, by including markers proximal to probesets and markers proximal to eQTL, both probesets and eQTL were placed on the same map, and the cis or trans orientation of eQTL relative to the probesets was determined (Table 5). A cis orientation was declared if the marker proximal to the probeset and the marker proximal to the eQTL were one and the same. In Table 5, a value of 1 for the cis match variable indicates this condition. A value of 0 indicates trans orientation. The one-to-one marker match rule for cis orientation has, as a result, a declaration of trans orientation of some eQTL in close proximity to the putative probeset locus. Alternatively, one could have declared cis orientation if the proximal probe marker and proximal eQTL marker were located within the limits of some arbitrarily prescribed map distance.
PLS Regression of Yield on Cis or Trans Oriented Proximal eQTL Markers
PLS regressions of yield on marker genotype were calculated separately for cis and trans orient markers. That is, one regression analysis was calculated for proximal eQTL markers having a “cis match” value equal to 1 in Table 5, and another regression analysis was calculated for proximal eQTL markers having a “cis match” value equal to 0.
Variation in cis oriented proximal eQTL marker genotypes accounted for only 38% of the variation in yield whereas the trans oriented counterparts accounted for 46% of yield variation, suggesting that variation in trans acting factors on gene expression at a locus is more important than variation at the locus itself (Table 6).
Note that some markers proximal to a probeset were declared to be in both cis and trans orientation. This occurs because the marker in question is proximal to an eQTL in cis orientation to the particular probeset locus while at the same time being proximal to an eQTL in trans orientation to another probeset.
All markers, classified cis, trans, or both, were entered into another PLS regression analysis with yield (Table 7). In the final model, variation in marker genotype accounted for 47% of the variation of yield, exactly the same rsquare value obtained from the regression of yield on trans oriented eQTL markers only. This result reinforces the conclusion that variation in trans acting elements may be more important than variation at the structural genes themselves.
The cis-trans orientation of all proximal eQTL markers to probesets is more fully portrayed in Table 8.
Table 9 provides a summary of the significant markers, including the corresponding SEQ ID No.
This example provides nucleic acid sequences associated with grain yield in maize. SEQ IDs are provided for the ORF corresponding to probesets associated with yield, wherein these nucleic acids are useful both as molecular markers for either SNP development or sequence-assisted breeding (U.S. Patent Application Ser. No. 60/942,707) and as nucleic acid expression constructs for use in transgenic breeding. Further, data are providing that support the value of these nucleic acid sequences for the prediction of yield, via correlation analyses for the probesets.
These yield-associated nucleic acids were identified from a transcript profiling experiment referred as Design II involving corn elite hybrid lines. The method uses inbreds which may or may not be related by descent and uniquely combine them in a Design II mating scheme to generate hybrids. The method further uses the hybrids derived by combining related inbreds to identify genetic gain for gene expression in relation to key economic traits like increased yield, nitrogen and stress tolerance.
The uniqueness of the design allows for the simultaneous evaluation of: general combining ability for gene expression in relation to key economic traits; specific combining ability for gene expression in relation to key economic traits; additivity in gene expression with respect to key economic traits; dominance in gene expression with respect to key economic traits; and heterosis and genetic gain for gene expression in relation to key economic traits. In addition, this design allows for identification of: key regulatory pathways affecting genes that impact key economic traits; mutations in genes leading to a change in gene expression in the process of breeding new inbreds which may contribute for a greater performance with respect to key economic traits; and genes which have a significantly different gene expression in the newer and more elite inbred lines (offspring) compared to parents based on differential gene expression in hybrids of which these parents are a part.
In Design II experiment, 30 hybrids from six female and five male inbred lines were grown in in three locations (JA, JE, and CA). Both leaf and kernel tissues were harvested at R2 and R4 stages for TxP experiment. In addition, kernel tissue at mature stage was also sampled at JA and JE locations for TxP. Hybrid yield data were collected on individual locations.
Correlation analysis was conducted between gene expression level and yield across the 30 hybrids by individual tissue and stage on each location. The genes showing the strongest correlations (both positive and negative) at all of the locations were selected. The cutoff for correlation coefficients was 0.5 or larger (−0.5 or lower) at each location. To this end, a total of 22 probesets were identified (5 positive, 16 negative). Very interestingly, 21 of the 22 strongest correlations are from the kernel rather than the leaf tissue.
These nucleic acid sequences are good candidates for use as markers in breeding as well as for testing in a corn transgenic pipeline. Biotechnological manipulation of these genes (Over expression, regulated expression or suppression or expression of protein variants) is expected to impart enhanced agronomic traits to plants e.g. increased yield, enhanced stress tolerance, increased nutrient utilization, increased seed oil etc.
The use of expression profile data enables detection of rare alleles or haplotypes in the genome of a plant. This is particularly important for leveraging rare but important genomic regions in a breeding program, such as a disease resistance locus from exotic or unadapted germplasm, wherein rare alleles are defined as occurring in low frequency within the germplasm pool and potentially being previously undetected within the germplasm pool. The present example provides methods for rare allele detection, experimental design (i.e., selecting exotic germplasm, germplasm with known phenotype of interest, screening non-elite germplasm), and utility (i.e., introgression programs for beneficial rare variants for specific traits and/or to expand germplasm diversity in one or more specific germplasm pools such as per maturity zone).
A set of germplasm comprising at least 2 germplasm entries is provided. Non-limiting factors influencing inclusion in a sequencing project for at least one locus include germplasm origin or geography, at least one genotype of interest, at least one phenotype of interest, performance in hybrid crosses, performance of a transgene, and other observations of the germplasm or predictions relating the germplasm and its performance.
Using the methods and approaches presented herein, at least one base pair position is queried using isolated nucleic acids and expression profile technology for at least 2 germplasm entries. Using methods known in the art for expression profile statistical methods, bioinformatics, and in silico evaluation, differences and similarities are identified and linked to the source germplasm entry. Following identification of alleles of interest, selection decisions can be made.
In the case of rare allele mining, the rare allele may be associated with a known phenotype. In addition, the identification of the rare allele can provide the basis for additional phenotyping, association studies, and other assays to evaluate the effect of the rare allele on plant phenotype and breeding performance. Further, the nucleic acid sequence of the rare allele can be immediately leveraged for use as a marker via methods known in the art and described herein to detect this rare allele in additional germplasm entries, to be used as a basis for selection, and to facilitate introgression of the rare allele in germplasm entries lacking the rare allele. In other aspects, the rare allele is isolated and the isolated nucleic acid is transformed into a plant using methods known in the art in order to confer a preferred phenotype to the recipient plant. The recipient plant can subsequently be used as a donor for conversion programs to cross with elite germplasm for trait integration purposes.
The identification of rare alleles is useful for leveraging the full genetic potential of any germplasm pool, i.e., set of 2 or more germplasm entries. This is useful for determining breeding cross strategy, increasing the diversity between 2 or more germplasm pools, evaluating heterotic pools, and informing breeding decisions. High throughput sequencing both accelerates the identification of the alleles and allows simultaneous detection of rare alleles and identification of associated markers.
This application claims priority to U.S. Provisional Application No. 61/085,501 (filed Aug. 1, 2008), which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
61085501 | Aug 2008 | US |