The present application claims priority from Japanese patent application JP 2018-3697 filed on Jan. 12, 2018, the content of which is hereby incorporated by reference into this application.
The present disclosure relates to an apparatus and a method for transcriptome analysis comprising analyzing information concerning a transcriptome.
As an attempt of estimating an organism's phenotype based on gene expression, a method of multiple regression analysis using gene expression data and phenotype data is known (Nature Communications 5, Article number: 5792, 2014; and WO 2016/148107). In the method disclosed in Nature Communications 5, Article number: 5792, 2014, limited gene expression data were used by, for example, selectively adopting data of the same operon exhibiting the highest expression level so as to avoid overlapping of gene expression data.
In general, the term “transcriptome” refers to all transcription products existing in tissue or cells under given conditions. Transcriptomes encompasses a transcription product from a coding region of the genome (i.e., mRNA) and a transcription product from a non-coding region of the genome (i.e., ncRNA). Through transcriptome analysis, a novel finding based on a state of gene expression, such as variation in gene expression affected by environmental factors or identification of genes expressed associated with phenotypes, can be achieved.
When analyzing a transcriptome, for example, transcription products existing in tissue or cells are extensively assayed with the use of microarray techniques or next-generation sequencing techniques. The assayed data are large quantity of nucleotide sequence data, which are typical big data.
As a method of statistical analysis of the data obtained, as disclosed in JP Patent No. 5854346, a method of adopting a technique of principal component analysis, which is one of technique of multivariate analyses, is known. Such method comprises conducting main component analysis concerning the training data instead of the nucleotide sequence data obtained by analysis, to achieve results that are comparable among samples under different conditions.
As a method of transcriptome analysis, as disclosed in JP 2017-51118 A, a method of generating the property variable estimation model for a target of analysis based on gene expression information (condition variable) and trait information (property variable) is known. In the method disclosed in JP 2017-51118 A, a regression model having regularization terms is generated by designating a property variable as an objective variable (i.e., a dependent variable) and each of condition variables as an explanatory variable. As a formula for determining a regression model, the least absolute shrinkage and selection operator regression (LASSO) method is exemplified.
LASSO regression is a technique of regularization (L1 regularization) adopted for prevention of overfitting in the fields of statistics, machine learning, and the like. LASSO regression is a technique of regression modeling based on sparse regularization, which is carried out by designating the parameter of less important data selected from a large quantity of data as zero and removing the same from the data (Robert Tibshirani, Journal of the Royal Statistical Society, Series B (Methodological) Vol. 58, No. 1, 1996, pp. 267-288).
In the transcriptome analysis described above, the number of transcription products with the available nucleotide sequences is very large relative to the number of samples to be analyzed. By the method disclosed in Nature Communications 5, Article number: 5792, 2014, accordingly, it was difficult to attain the meaningful results of analysis. In the case of the method of analysis that adopts the LASSO regression analysis disclosed in JP 2017-51118 A, satisfactory results of analysis are expected even if the number of transcription products with the available nucleotide sequences is very large relative to the number of samples to be analyzed. However, transcriptome analysis was required to show further improvement in accuracy of the results of analysis.
Under the above circumstances, the present disclosure provides an apparatus and a method for transcriptome analysis that enable more accurate transcriptome analysis with the use of the nucleotide sequence data concerning a transcription product.
The present disclosure encompasses the following.
a data set-generator that generates the first to the mth subdata sets (m≥2) by randomly deleting gene expression level data from a plurality of data sets including objective variable data and gene expression level data;
an estimation formula-calculator that applies the method of regression analysis having regularization terms to each of the first to the mth subdata sets to calculate the first to the mth estimation formulae using objective variable data as objective variables and gene expression level data as explanatory variables; and
a list of genes-generator that generates a list of genes corresponding to the gene expression level data included in the first to the mth estimation formulae.
a step of generating a subdata set that generates a subdata set by randomly deleting gene expression level data from a plurality of data sets including objective variable data and gene expression level data;
a step of calculating an estimation formula that applies the method of regularization to the subdata set to calculate an estimation formula using objective variable data as objective variables and gene expression level data as explanatory variables;
a step of gene recording that records genes corresponding to the gene expression level data included in the estimation formula; and
a step of generating a list of genes that repeats the step of generating the subdata set, the step of calculating an estimation formula and the step of gene recording m number of times (m≥2) to generate a list of the recorded genes.
The apparatus and the method for transcriptome analysis according to the present disclosure enable transcriptome analysis with high accuracy. With the use of the apparatus and the method for transcriptome analysis according to the present disclosure, accordingly, analysis of variations in gene expression caused by particular factor such as a given state or condition, expression analysis of genes correlated with phenotypes, prediction analysis of traits based on gene expression, and the like can be performed with high accuracy.
Hereafter, the apparatus and/or the method for transcriptome analysis of the present disclosure are described in detail with reference to the drawings.
As shown in
The data set to be inputted into the apparatus 1 for transcriptome analysis includes given objective variable data and gene expression level data (p-dimensional data). The term “objective variable data” refers to numerical data concerning phenotypes including quantitative or qualitative traits, numerical data concerning the presence or absence of conditions of circumferential environment and an extent thereof, and numerical data concerning the presence or absence of the treatment provided for the target organism of analysis and an extent thereof More specific examples of objective variable data include data concerning the amount of target organisms of analysis, such as plants, produced (e.g., the weight of the terrestrial part, the weight of the root, the area of the leaf, and the seed yield) and data concerning stress imposed on target organisms of analysis (e.g., the duration of high-temperature treatment, the duration of low-temperature treatment, the concentration of drug treatment, and the duration of pest stress exposure).
The gene expression level data are numerical data indicating the relative expression level of the transcription product to be observed. Specific examples of gene expression level data include microarray data obtained with the use of commercially available microarrays for gene expression analysis (for transcriptome analysis) and the contract service for gene expression analysis provided in the market and sequence data obtained via expression analysis (RNA-Seq) using the next-generation sequence. In particular, the gene expression level data are preferably sequence data obtained via expression analysis (RNA-Seq) using the next-generation sequencer because the sequence data obtained via expression analysis (RNA-Seq) using the next-generation sequencer encompasses transcription products of target organisms.
With the use of the apparatus 1 for transcriptome analysis, a list of genes capable of describing the objective variable data can be generated by analyzing the data set described above. The list of genes is not limited to the list of genes encoding proteins in a narrow sense, and it encompasses a list of transcription products from the non-coding region.
When the early-stage growth level of a plant (the plant weight during a given period) is designated as the objective variable data, for example, a list of genes capable of describing the early-stage growth level can be generated with the use of the apparatus 1 for transcriptome analysis. When the concentration of a drug used to treat a plant is designated as the objective variable data, a list of genes expressed in accordance with the concentration of the drug can be generated with the use of the apparatus 1 for transcriptome analysis. When the temperature at the time of sampling is designated as the objective variable data, a list of genes expressed in accordance with the growth temperature can be generated with the use of the apparatus 1 for transcriptome analysis.
The apparatus 1 for transcriptome analysis of the constitution shown in
At the outset, the gene expression level data outputted by a microarray apparatus, next-generation sequencer, or the like and the objective variable data are inputted (Step S1). The inputted gene expression level data are designated as the p-dimensional vector of explanatory variables (x={x1, . . . . . . , xp}). The inputted objective variable is designated as “y.” In this example, n sets of data sets ({(yi, x1)|i=1, . . . . . , n}) composed of the p-dimensional vector of explanatory variables x and the objective variable y are to be inputted.
In the data set generation unit 2, subsequently, the gene expression level data contained in the inputted n sets of data sets are sampled at random, so as to generate p−1 dimensional or lower subdata sets (Step S2). In this step, the mth subdata set designating the initial value as m=1 is generated. In other words, the generated mth subdata set is defined as a data set including a smaller number of gene expression level data than the inputted data sets as a result of random deletion of the gene expression level data contained in the inputted data set.
The generated mth subdata set may be a part of the gene expression level data (p-dimensional) included in the inputted data sets. For example, the mth subdata set may be 5% to 90%, 5% to 70%, 5% to 50%, 10% to 50%, 10% to 25%, or 10% to 15% of the p-dimensional gene expression level data.
When the number of gene expression level data is 30,000 (i.e., P=30,000: 30,000 transcription products), for example, the mth subdata set generated in the data set generation unit 2 can include randomly selected 1,000 to 20,000 gene expression level data, preferably 1,500 to 15,000 gene expression level data, more preferably 1,500 to 7,500 gene expression level data, and further preferably 1,500 to 4,500 gene expression level data.
In the estimation formula calculation unit 3, subsequently, the mth subdata set generated in the data set generation unit 2 is subjected to the method of regression analysis having regularization terms to calculate the mth estimation formula having the objective variable data as objective variables and the gene expression level data as explanatory variables (Step S3). The method of regression analysis having regularization terms is also referred to as the “regularization regression model,” which is a method of analysis comprising adding restrictions (penalties) to the least squares to shrink the estimator. Specific examples of methods of regression analysis having regularization terms include the method of LASSO regression analysis, the method of Ridge regression analysis, and the method of elastic net regression analysis. It is particularly preferable that the estimation formula be calculated in accordance with the method of LASSO regression analysis. When the method of LASSO regression analysis is adopted, in particular, the parameter of the gene expression level data that are not important to describe objective variables is designated as 0 in the estimation formula calculated in this step.
When calculating an estimation formula in accordance with the method of LASSO regression analysis, a reference may made to Friedman et al., Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, January 2010, Volume 33. Issue I.
In the gene list generation unit 4, subsequently, the gene expression level data included in the estimation formula calculated in the estimation formula calculation unit 3 are extracted and genes corresponding to the extracted gene expression level data are recorded (Step S4). Since the estimation formula is calculated by the method of regression analysis having regularization terms, specifically, the gene expression level data that are important to describe objective variables can be selectively extracted. When the method of LASSO regression analysis is adopted as the method of regression analysis having regularization terms, for example, gene expression level data other than the gene expression level data with the parameter designated as 0 are extracted.
In Step S5, subsequently, whether or not the processing of Steps S2 to S4 were repeated the predetermined number of times (in times) is determined. When the number of repetition was determined as 10,000 in advance (m=10,000), for example, the first subdata set with the initial value designated as 1 was subjected to the processing of Steps S2 to S4. In Step S5, subsequently, the value (m=1) is compared with 10,000 and the processing proceeds to Step S6. In Step S6, the in value is increased by 1, and Steps S2 to S5 are repeated until the m value reaches 10,000.
By repeating the Steps S2 to S6 the in number of times, the 1st to the mth subdata sets of the n sets of data sets inputted in Step S1 can be appropriately subjected to the first to the mth estimation formulae.
In Step S7, subsequently, a list of genes recorded in Step S4 with respect to the first to the mth estimation formulae is outputted in the gene list generation unit 4. The list of genes generated in the gene list generation unit 4 is not particularly limited, genes corresponding to the gene expression level data extracted in the gene list generation unit 4 may be enumerated, or the list may have format comprising the correlation between the genes corresponding to the extracted gene expression level data and the appearance probability of the gene. The appearance probability can be calculated as the frequency of a particular gene to be included relative to the number of all the genes included in the first to the mth estimation formulae.
The list of genes generated in the gene list generation unit 4 may selectively include genes exhibiting the appearance probability exceeding a given level. Alternatively, the list may include genes in descending order of the appearance probability.
As shown in
According to the lists of the genes shown in
The apparatus and the method for transcriptome analysis of the present disclosure are not limited to those according to the first embodiment described above. As shown in
The apparatus 10 for transcriptome analysis shown in
As shown in
In the estimation model formula generation unit 11, an estimation model formula describing given objective variable data may be generated through multiple regression analysis or machine learning with the use of objective variables y and explanatory variables x concerning all the genes included in the list generated in the gene list generation unit 4. Alternatively, an estimation model formula describing given objective variable data may be generated through multiple regression analysis or machine learning with the use of objective variables y and explanatory variables x concerning a part of the genes included in the list. Examples of a part of genes included in the list include genes exhibiting the expression frequency over the threshold or genes correlated with given annotation information.
A method for constructing an estimation formula is not particularly limited, and examples include multiple regression analysis, such as the method of LASSO regression analysis, the method of Ridge regression analysis, and the method of elastic net analysis, and machine learning, such as Random forest analysis and deep learning.
For example, an estimation model formula can be prepared by the Random forest method in the estimation model formula generation unit 11. The estimation model formula prepared by the Random forest method is a model formula in the form of a decision tree that calculates a given objective variable y and is prepared as a function for the expression level data x of the gene included in the list generated in the gene list generation unit 4. With the estimation model formula generated in the estimation model formula generation unit 11, the estimated values for the objective variables concerning a given organism can be determined on the basis of the gene expression level data obtained for the organism.
When the estimation model formula is constructed by designating the plant's seed yield as an objective variable, for example, the plant's seed yield can be estimated with the use of the gene expression level data obtained from the target plant. Specifically, the objective variable, such as the seed yield, can be estimated based on the gene expression level data that are readily obtained using a next-generation sequencer without performing cultivation tests of the target plant.
With the use of the apparatus for transcriptome analysis according to the present embodiment, as described above, an estimation model formula concerning a given objective variable can be prepared. With the use of the estimation model formula prepared, for example, the property evaluation system 20 for the target organism can be constructed as shown in
The property evaluation system shown in
When the gene expression level data concerning the target organism are inputted, the estimation unit 22 assigns the gene expression level data into the relevant estimation model formula stored in the storage unit 21, and it can determine the estimated values for various objective variables.
As described above, the property evaluation system 20 stores estimation model formulae for various objective variables in the storage unit 21. On the basis of the gene expression level data of the target organism, accordingly, the estimated values concerning the objective variables can be outputted. When the gene expression level data concerning a given plant are inputted, for example, estimated values concerning the objective variables, such as the weight of the terrestrial part, the weight of the root, the leaf area, the seed yield, the duration of high-temperature treatment, the duration of low-temperature treatment, the concentration of drug treatment, and the duration of exposure to disease and pest stress, can be collectively or selectively obtained.
The apparatuses and the methods for transcriptome analysis according to the first embodiment and the second embodiment described above can be realized by, for example, a computer comprising a central processing unit (CPU), a main storage unit, an auxiliary storage unit, an output unit, and an input unit. For example, objective variable data and gene expression level data can be inputted through the input unit under the control of CPU and stored in the main storage unit or in the auxiliary storage unit. For example, the mth subdata set can be generated under the control of CPU in accordance with a given algorithm. In addition, the mth estimation formula based on the mth subdata set can be generated under the control of CPU in accordance with a given algorithm. Thus, the apparatuses and the methods for transcriptome analysis according to the first embodiment and the second embodiment described above can be realized under the control of CPU.
The apparatuses and the methods for transcriptome analysis according to the first embodiment and the second embodiment described above can be realized by so-called cloud computing. In cloud computing, for example, objective variable data and gene expression level data stored in a cloud server can be used, and the generated estimation formulae and gene lists can be stored in the cloud server.
Hereafter, the present disclosure is described in greater detail with reference to the following examples, although the technical scope of the present disclosure is not limited to these examples.
In this example, Ouu 365/Arroz da Terral/Guu 365 backcross inbred lines (Bits) described in Fukuda et al., 2014, Plant Production Science 17: 41-46 were used. Seeds of the lines were disinfected with a 50-fold dilution of hypochlorite, washed 3 times with tap water, and then soaked in water at 30° C. for 2 days for germination. Germinated seeds (24 seeds per line) were sowed in a hydroponic float system (Fukuda et al., 2012, Plant Production Science 15: 183-191) and allowed to grow in a hydroponic solution (Hayashi and Chino, 1986, Plant and Cell Physiology 27: 1387-1393). The hydroponic solution was replaced with a fresh solution every 2 days, and the plants were allowed to grow at 20° C. in a growth chamber in a light-dark cycle of 12 hours for 14 days.
Seedlings of the BIL104 lines and 2 parent lines were sampled 14 days after germination and dehydrated in a dryer at 80° C. for 2 days, Thereafter, seeds and roots were removed and the resulting plants were weighed. An experiment was composed of 3 biological replicates, and the average dry weight of the terrestrial parts of the seedlings was used for QTL analysis. BIL genotypes were analyzed with the use of 124 types of SSR markers (Fukuda et al., 2014, Plant Production Science 17: 41-46), and QTL analysis was carried out with the use of MAPMAKER/EXP 3.0 (Lander et al., 1987, Genomics 1: 174-181, doi: 10.1016/0888-7543 (87) 90010-3) and QTL Cartographer 2.5 (Wang et al., 2010, Statistical Genetics & Bioinformatics, provided by North Carolina State University).
Parent lines (Ouu365 and Arroz da Terra) and BIL20 lines with different early-stage grow levels were selected and subjected to RNA-seq analysis. Seeds and roots were removed from the seedlings 14 days after germination, the fresh weight of the terrestrial parts of the seedlings was measured, and the seedlings were frozen in liquid nitrogen, followed by storage at −80° C. before analysis. RNA was extracted using the RNeasy mini Kit (manufactured by Qiagen) and RNA-seq analysis was then performed. After quantitative and qualitative RNA analyses were carried out with the use of the 2100-Bioanalyzer (manufactured by Agilent Technologies), a sequencing library was prepared with the use of the TruSeq RNA LT Sample Preparation Kit v2 (manufactured by Illumina Inc.). The library was subjected to 100-bp single-end read cycles of sequencing on Illumina Hiseq 2000. Fastq files of the sequencing results are shown in DDBJ Sequence Read Archive (DRA), Accession No. DRA006312.
The sequence data were subjected to mapping with the use of TopHat2 (Kim et al., 2013, Genome Biology 14: 13, doi: 10.1186/gb-20 13-14-4-r36; Trapnell et al., 2009, Bioinformatics 25: 1105-1111. doi:10.1093/bioinformatics/btp120) using the Oryza saliva-Nipponbare-Reference-IRGSP-1.0 genome (Oryza saliva.IRGSP-1.0.24.dna.toplevel.fa.gz, ftp://ftp.ensemblgenomes.org/pub/release-24/plants/fasta/oryza_sativa/dna/) and the gene set (Oryza sativa, IRGSP-1.0.24.gtf.gz, ftp://ftp.ensemblgenomes.org/pub/release-24/plants/gtf/oryza_sativa/) as the reference sequences. Each gene expression level was calculated in terms of the fragments per kilobase million (FPKM) value.
The biomarker for the expression level representing the fresh weight of the terrestrial part of the seedling and the appearance probability of gene were examined in the manner described below. The genes (37,043 species) exhibiting the average expression level of 0.01 or higher were subjected to the analysis in the manner described below. Each gene expression level was converted into the Log 2 value after 0.01 was added to the FPKM value. The biomarker for the expression level was selected using the L1-formed regression model by the LASSO method (Tibshirani, 1996, Journal of the Royal Statistical Society Series, B-Methodological 58: 267-288). In order to calculate the biomarker gene appearance probability, biomarker selection was repeated with the use of a transcriptome subset. From among 37,043 genes, 10% thereof were randomly selected and used for LASSO analysis as variables. From among the inputted variables, 8 genes were selected as appropriate explanatory variables exhibiting a coefficient other than zero and designated as the biomarkers for the expression level. Selection of the subset and quantification of the biomarkers for the expression levels were repeated 10,000 times. The frequency for each gene to be selected as a biomarker in 10,000 trials was determined. Analysis was carried out using the R glmnet package (R. Core Team, 2015, R: A language and environment for statistical computing, https://www.R-project.org/).
The SG1 gene coding regions of Ouu365 and of Arroz da Terra and upstream 2,108-bp regions were amplified via PCR using the primers: 5′-GGGACGTGATAACCGACTCA-3′ (SEQ ID NO: 1) and 5′CCCCACTGTACGTTCTCTCC-3′ (SEQ ID NO: 2). The PCR products were purified using the illustra ExoProStar kit and sent to Fasmac Co., Ltd. for sequencing.
In order to detect single nucleotide substitution at a site 1,948-bp upstream from the translation initiation site, the PCR product was amplified using the following primers: 5′-GGGACGTGATAACCGACTCA-3′ (SEQ ID NO: 3) and 5′- TTCAGGTCACCTAGCCCATC-3′ (SEQ ID NO: 4) and cleaved with the restriction enzyme HaeIII. While the Arroz da Terra sequence (GGCC) was cleaved with HaeIII, the Ouu365 sequence (AGCC) was not cleaved.
Total RNA was extracted from the terrestrial part of the seedling in the manner described above. cDNA was synthesized. from 1 μg of total RNA using the PrimeScript RT reagent Kit with gDNA Eraser (Takara Bio). The SG1 cDNA level was quantified via real-time PCR using the Thermal Cycler Dice Real Time System III, SYBR Premix Ex Taq, and the primer set OA045647 (Takara Bio). Real-time PCR was composed of 3 technical replicates. In order to determine the copy number of SG1 mRNA, the PCR product of SG1 was amplified using cDNA of Ouu365 as a template and the following primers: 5′-CGACCAGCTGATCTCCAA-G3′ (SEQ ID NO: 5) and 5-CATTTTTACTGGCCCTTCCA-3′ (SEQ ID NO: 6), and the amplified product was used as a standard for real-time quantitative PCR. The standard PCR product was quantified using the Qubit fluorometer (Thermo Fisher Scientific), and the copy number was determined based on the molecular weight. The SG1 expression level (copies per ng RNA) was converted into the Log 2 value and then used for QTL analysis.
The average dry weight of the terrestrial part of Arroz da Terra and that of Ouu365 were 5.11 mg and 2.91mg, respectively, 14 days after germination. That is Arroz da Terra was significantly heavy (t-test; significance: 5%). The dry weight of the terrestrial part of the BIL104 line was between 2.52 mg and 5.47 mg (
In order to search for transcription products correlated with the early-stage growth level, 2 parent lines and 20 BILs having different early-stage growth levels were used (
The number of reads was 41.6 M on average per sample, and 40.0 M reads/sample accounting for 96.1% of the total reads were mapped on the Os-Nipponbare-Reference-IRGSP-1.0 genome. The gene expression level was determined in terms of FPKM (fragments per kilobase of coding sequence per million reads). The appearance probabilities for genes serving as biomarkers indicating the fresh weights of the terrestrial parts of the seedlings was determined in the manner described below by the method described in “1-2-3. Biomarker for expression level and calculation of appearance probability of gene” above. From among all the expressed genes, 10% of the genes was randomly selected to constitute a subset, 8 genes were selected from the subset via LASSO analysis as explanatory variables indicating the fresh weights of the terrestrial parts of the seedlings (biomarkers for the expression levels). Selection of the subset and quantification of the biomarkers for the expression levels were repeated 10,000 times, and the frequency for each gene to be selected as a biomarker for the expression level was determined. A gene selected at high frequency indicates that the expression level thereof is correlated with the fresh weight of the terrestrial part of the seedling. At high frequency, 158 genes were selected (with a probability of 1% or higher).
2-3. Biomarker Genes Selected at High Frequency Located Inside QTL Associated with the Weight of the Terrestrial Part of the Seedling
The genes selected at high frequency and the QTL associated with the weight of the terrestrial part of the seedling were examined, and the number of the genes located inside QTL of the chromosomes 3, 7, and 10 was 5, 6, and 4, respectively. The known gene associated with low-temperature germination (gLTG3-1) was included in the gene in the chromosome 3 (RAP ID: Os03g0103300, Fujino et al., 2008, Theoretical and Applied Genetics 108: 794-799, doi: 10.1007/s00122-003-1509-4). A significant positive correlation was observed between the qLTG3-1 expression level and the fresh weight of the terrestrial part of the lines used for RNA-seq 14). While a parent line (Arroz da Terra) is reported to have a functional gLTG3-1 gene (Fujino and Iwata, 2011, Theoretical and Applied Genetics 123: 1089-1097, doi: 10.1007/s00122-011-1650-4), another parent line (Ouu365) is found to lack a 71-bp site in the qLTG3-1 gene coding region and lack functions thereof (Fukuda et al., 2014, Plant Production Science 17: 41-46). The qLTG3-1 genotype of the BIL line used for RNA-seq was examined. As a result, the fresh weight of the terrestrial part and the qLTG3-1 expression level of the line having Arroz da Terra-type qLTG3-1 were found to be significantly higher than those of the line having the Ouu365-type qLTG3-1(t-test: significance: 1%).
2-4. Biomarker Genes Selected at High Frequency Located Outside QTL Associated with the Weight of the Terrestrial Part of the Seedling
Among the genes selected at high frequency that were not located inside the QTL associated with the weight of the terrestrial part of the seedling, the known gene associated with inhibition of tissue elongation, SGI (Short Grain 1, RAP ID: Os09g045920 0, Nakagawa et al., 2012, Plant Physiology 158: 1208-1219, doi: 10.1104/pp. 111.187567), was included. A significant negative correlation was observed between the SG1 gene expression level and the fresh weight of the terrestrial part of the lines used for RNA-seq (
All the BIL104 lines and parent lines were subjected to quantitative real-time PCR to measure the SG1 expression levels in order to examine whether or not BIL lines other than those subjected to RNA-seq analysis would show the correlation between the SGI expression level and the weight of the terrestrial part of the seedlings. As a result, a significant negative correlation was observed between the SGI expression level and the fresh weight of the terrestrial part of the BIL104 lines and the parent lines (
This example demonstrates that candidate biomarker genes serving as indicators for the early-stage growth can be selected via RNA-seq analysis using the BIL20 lines and the parent lines. In addition, the known gene associated with inhibition of tissue elongation (SG1) was found to be included therein. While SGI is found to have effects of inhibiting tissue elongation in SG1-overexpressing transformants by activation tagging (Nakagawa et al., 2012, Plant Physiology 158: 1208-1219, doi: 10.1104/pp. 111.187567), whether or not SG1 expression levels vary between lines in a natural state remains unknown. The transcriptome analysis performed in this example demonstrates that there is a negative correlation between the SG1 expression level and the fresh weight of the terrestrial part of the seedling of the BIL lines (
2-6. Usefulness of transcriptome analysis demonstrated in this example
While an extensive analysis of transcription products (i.e., transcriptome analysis) is a powerful tool capable of detecting transcription products that would affect various morphological and physiological traits, transcription products are influenced by many environmental and genetic factors in complicated ways. In order to statistically select biomarkers for expression levels indicating particular traits, accordingly, it is considered preferable that a large number of samples (e.g., hundreds or more) be used, so as to remove noises. However, it is often difficult to prepare a large number of samples as many as hundreds of samples and perform gene expression analysis such as RNA-Seq.
In the transcriptome analysis described in this example, a relatively small number of samples (22 lines; i.e., 20 BIL lines and 2 parent lines) were used, so as to detect biomarkers for expression levels indicating the seedling weight. As a result of the transcriptome analysis described in this example, 2 types of known genes; i.e., qLTG3-1 and SG1, with or without a genomic mutation were detected as candidate biomarkers. The results demonstrate that the transcriptome analysis described in this example may enable effective selection of biomarkers for expression levels through analysis of a relatively small number of samples.
In this example, the list of genes selected at high frequency prepared in Example 1 (
Among the 158 genes in the list of genes selected at high frequency prepared in Example 1 (
A 5-fold cross validation was repeated 20 times to determine the estimated fresh weights of the terrestrial parts of the seedlings.
Number | Date | Country | Kind |
---|---|---|---|
2018-003697 | Jan 2018 | JP | national |