This application claims priority to Chinese application number 201811582470.8, filed Dec. 24, 2018 with a title of METHOD FOR HIGH-THROUGHPUT DETECTION OF DIFFERENTIAL EXPRESSION OF PLANT CIRCRNA ALLELIC LOCI, which is incorporated herein by reference in its entirety.
The present invention relates to the technical field of gene expression detection, and in particular, to methods for high-throughput detection of differential expression of plant circRNA allelic loci.
An allele (also called allelomorph) generally refers to a pair of genes that regulate phenotypic traits at the same position on homologous chromosomes
Allelic Expression Imbalance (AEI) is in the same cell, and each gene generally has two copies. The cis-acting or trans-acting shifts the expression ratio of the two copies of the gene by 1:1. The phenomenon of AEI is ubiquitous. In addition to the absolute imbalance expression of genetic imprinted gene, a considerable number of genes have AEI in different individuals and in the same individual at different developmental phase and tissues. It is also associated with polymorphic loci in specific regions of the genome.
At present, the commonly used AEI detection mainly focuses on the genes coding for proteins. However, there is no high-throughput accurate analysis method for the allelic expression of circRNA widely existed in transcriptome data.
In view of the above, an objective of the present invention is to provide a method for high-throughput detection of differential expression of plant circRNA allelic loci.
To achieve the above purpose, the present invention provides the following technical solution.
A method for high-throughput detection of differential expression of plant circRNA allelic loci, which includes the following steps:
1) extracting total RNA of a plant sample, and constructing a strand-specific library with the total RNA;
2) performing paired-end sequencing on the strand-specific library in step 1) with Illumina HiSeq to obtain raw sequencing data;
3) screening circRNAs data from the raw sequencing data obtained in step 2);
4) extracting reverse splicing reads at a circRNAs looping position from the circRNAs data obtained in step 3);
5) performing single nucleotide variation detection on the reverse splicing reads to obtain SNP loci in the reverse splicing reads; and
6) collecting statistics about the number of reads of different genotypes, which are aligned to the SNP loci in step 5, in the reverse splicing reads in step 4, to align the proportion of the number of reads of different genotypes to the proportion of expression quantities of different genotypes.
Preferably, the screening circRNAs data in step 3) includes the following steps: 3.1) performing transcript splicing on the raw sequencing data according to a reference genome;
3.2) extracting 18-22 nt (nucleotides) from both ends of each read in the raw sequencing data that is not aligned to the reads on the reference genome, and forming a pair of anchor sequences, the anchor sequence including a 5′-terminal sequence and a 3′-terminal sequence;
3.3) re-aligning the anchor sequence to a reference genome, where the 5′-terminal sequence of the anchor sequence is aligned to the 3′-terminal of the reference sequence, and the 3′-terminal sequence of the anchor sequence is aligned to the upstream of a matching locus of the 5′-terminal sequence of the anchor sequence in the reference sequence, and a splicing locus GT-AG exists between the matching locus of the 5′-terminal sequence of the anchor sequence and a matching locus of the 3′-terminal sequence of the anchor sequence in the reference sequence, and this splicing locus read is used as circRNA data.
Preferably, the screening circRNAs data is implemented by means of the find_circ software and the CIRIexplorer software.
Preferably, the circRNAs are screened by using the find_circ software and the CIRIexplorer software, respectively, to obtain candidate data of circRNAs screened by the find_circ software and candidate data of circRNAs screened by the CIRIexplorer software, and an intersection of the candidate data of circRNAs screened by the find_circ software and the candidate data of circRNAs screened by the CIRIexplorer software is used as the circRNAs data.
Preferably, the extracting reverse splicing reads at a circRNAs looping position in the circRNAs obtained in step 4) is implemented using a samtools view—R instruction in the find_circ software.
Preferably, the single nucleotide variation detection in step 5) is performed using a SNP calling in the GATK software.
Preferably, after extracting total RNA of a plant sample and before constructing a strand-specific library in step 1), the method further includes a step of removing rRNA and a step of linear RNA digestion that are performed sequentially.
Preferably, a reaction system of the linear RNA digestion is 50 μL, including the following components: 5 μg of RNA, 5 μL of 10× Reaction Buffer, 20 U of RNase R, and the balance of RNase-Free water.
Preferably, the temperature for the linear RNA digestion is 36-38° C., and the time for the linear RNA digestion is 1-2 h.
Preferably, the plant is a forest tree.
The advantageous effects of the present invention: the method of the present invention can perform high-throughput and accurate analysis of differential expression of allelic loci for circRNAs widely existing in transcriptome data and provide a new research strategy for systematic analysis of the transcriptome data.
The present invention provides a method for high-throughput detection of differential expression of plant circRNA allelic loci, including the following steps:
1) total RNA of a plant sample is extracted, and a strand-specific library is constructed with the total RNA;
2) paired-end sequencing is performed on the strand-specific library in step 1) with Illumina HiSeq to obtain raw sequencing data;
3) circRNAs data is screened from the raw sequencing data obtained in step 2);
4) reverse splicing reads at a circRNAs looping position are extracted from the circRNAs data obtained in step 3);
5) single nucleotide variation detection is performed on the reverse splicing reads to obtain SNP loci in the reverse splicing reads; and
6) the number of reads of different genotypes, which are aligned to the SNP loci in step 5, in the reverse splicing reads in step 4 are counted to align the proportion of the number of reads of different genotypes to the proportion of expression quantities of different genotypes.
The present invention extracts total RNA of the plant sample and constructs a strand-specific library with the total RNA. The present invention has no special requirement for the types of plant samples, and conventional plants may be used, preferably forest trees. In the specific implementation process of the present invention, Populus tomentosa is selected. The plant sample in the present invention is preferably a leaf tissue. The present invention has no particular limitation on the method for extracting total RNA of the plant sample, and the conventional total RNA extraction method in the art may be used. In the specific implementation process of the present invention, the total RNA is extracted by using an RNA extraction kit (e.g., MagJ ET Plant RNA Purification Kit, No. K2772).
After extracting the total RNA and before constructing the strand-specific library, the present invention preferably further includes a step of removing rRNA and a step of linear RNA digestion that are performed sequentially. The step of removing rRNA is preferably performed by using Ribo-Zero™ rRNA Removal Kits (Plant) (No. MRZPL116). In the present invention, the method for removing rRNA preferably includes: mixing 30-50 μl of total RNA with 50-70 μl of magnetic beads, vortexing for 8-12 s, standing at room temperature for 4-6 min, incubating at 49-51° C. for 4-6 min, placing on a magnetic frame until a supernatant is clarified, and collecting the supernatant; and more preferably, mixing 40 μl of total RNA with 60 μl of magnetic beads, vortexing for 10 s, standing at room temperature for 5 min, incubating at 50° C. for 5 min, placing on a magnetic frame until a supernatant is clarified for 2 min, and collecting the supernatant.
The present invention obtains a Poly (A)-RNA sample, i.e., a linear RNA, after the step of removing the rRNA, preferably digests the obtained linear RNA. The digestion is preferably performed by using RNase Rd. A reaction system of the linear RNA digestion is preferably 50 μL, including the following components: 5 μg of RNA, 5 μL of 10× Reaction Buffer, 20 U of RNase R, and the balance of RNase-Free water. The temperature for the linear RNA digestion is preferably 36-38° C., and more preferably 37° C. The time for the linear RNA digestion is preferably 1-2 h, and more preferably 1.5 h. In the present invention, after the digestion, a strand-specific library is constructed using the digested RNA. In the present invention, a SMART Kit (SMART cDNA Library Construction Kit, NO. 634901) is preferably used to construct the strand-specific library.
After obtaining the strand-specific library, the present invention performs paired-end sequencing on the strand-specific library with Illumina HiSeq to obtain raw sequencing data. The read length of the sequencing described in the present invention is preferably 150 nt. The amount of data for sequencing is preferably greater than 12 G. The sequencing in the present invention may be entrusted to Novogene Company.
After obtaining the raw sequencing data, the present invention screens circRNAs data from the obtained raw sequencing data. In the particular implementation process of the present invention, a linker and the redundant sequence in the raw sequencing data are first removed. In the present invention, the screening circRNAs data includes the following steps:
3.1) transcript splicing is performed on the raw sequencing data;
3.2) 18-22 nt are extracted from both ends of each read in the raw sequencing data that is not aligned to the reads on the reference genome, and a pair of anchor sequences is formed, the anchor sequence including a 5′-terminal sequence and a 3′-terminal sequence; and
3.3) the anchor sequence is re-aligned to a reference genome, where the 5′-terminal sequence of the anchor sequence is aligned to the 3′-terminal of the reference sequence, and the 3′-terminal sequence of the anchor sequence is aligned to the upstream of a matching locus of the 5′-terminal sequence of the anchor sequence in the reference sequence, and a splicing locus GT-AG exists between the matching locus of the 5′-terminal sequence of the anchor sequence and a matching locus of the 3′-terminal sequence of the anchor sequence in the reference sequence, and this splicing locus read is used as circRNA data.
In the present invention, the transcript splicing is performed preferably using default parameters of the Cufflinks software. Steps 3.2) and 3.3) are preferably implemented by the find_circ software and the CIRIexplorer software. More preferably, the circRNAs are screened by using the find_circ software and the CIRIexplorer software, respectively, to obtain candidate data of circRNAs screened by the find_circ software and candidate data of circRNAs screened by the CIRIexplorer software, and an intersection of the candidate data of circRNAs screened by the find_circ software and the candidate data of circRNAs screened by the CIRIexplorer software is used as the circRNAs data.
In the specific implementation process of the present invention, the screening parameters of the find_circ software and the CIRIexplorer software for screening circRNAs include -q 5, -a 20, -m 2, -d 2, and --noncanonical. The screening criteria of the foregoing parameters are preferably: (1) -q 5: the minimum support number of anchor sequence alignment is 5; (2) -a 20: the anchor sequence is 20 bp; (3) -m 2: a branch point cannot occur elsewhere in the range of two nucleic acids in the anchor sequence; (4) -d 2: sequence alignment only supports two mismatches; and (5) --noncanonical: GU/AG appears on both sides of a splicing locus and a clear breakpoint can be detected.
Since the find_circ software and the CIRIexplorer software generate false positive data in the process of screening circRNAs, the intersection of the candidate data of circRNAs screened by the find_circ software and the candidate data of circRNAs screened by the CIRIexplorer software can greatly reduce the false positives, thereby ensuring the authenticity and accuracy of the screened circRNAs data.
After obtaining the circRNAs data, the present invention extracts reverse splicing reads at a circRNAs looping position in the circRNAs data. In the present invention, the extracting reverse splicing reads at the circRNAs looping position from the obtained circRNAs data is preferably implemented using a samtools view—R instruction in the find_circ software.
After obtaining the reverse splicing reads, the present invention performs single nucleotide variation detection on the reverse splicing reads to obtain SNP loci in the reverse splicing reads. The single nucleotide variation detection is preferably performed using a SNP calling in the GATK software.
After obtaining the SNP loci in the reverse splicing reads, the present invention collects statistics about the number of reads of different genotypes in the reverse splicing reads that are aligned to the SNP loci, to align the proportion of the number of reads of different genotypes to the proportion of the expression quantities of different genotypes.
The method of the present invention can realize high-throughput and high-accuracy analysis of the differential expression of the circRNAs allelic loci through the foregoing steps: provide technical support for the analysis of the allelic expression patterns of subsequent circRNAs, and lay a foundation for comprehensively deciphering the regulation effect of gene plant genome allelic expression and the genetic effect of genomic imprinting. Thus, the present invention has a great application value in the analysis of the genetic effects of plant complex traits and molecular design breeding.
The technical solution provided by the present invention will be described below in detail with reference to examples. However, the examples should not be construed as limiting the protection scope of the present invention.
Fresh leaves of P. tomentosa are used. The total RNA is extracted using an RNA extraction kit (MagJ ET Plant RNA Purification Kit, No. K2772). The rRNA is removed using the Ribo-Zero™ rRNA Removal Kits (Plant) (No. MRZPL116) to obtain a Poly (A)-RNA sample. The linear RNA is digested with RNase Rd (the reaction system includes 5 μg of RNA, 5 μL of 10× Reaction Buffer, 20 U of RNase R, and RNase-Free water supplemented to 50 μL) to obtain the Poly (A)-/Ribo-RNA sample. The strand-specific cDNA library is constructed using a SMART kit (SMART cDNA Library Construction Kit, NO. 634901).
Paired-end sequencing is performed using Illumina HiSeq™ 2500 with a sequencing data volume of 12 G. A linker and a redundant sequence are removed; and the transcript is spliced through the default parameters of the Cufflinks software. 20-nt is extracted, with find_circ, at either end of a sequence that is not aligned to a reference sequence (the reference sequence is the populus trichocarpa V3.0 genomic sequence https://phytozome.jgi.doe.gov/pz/portal.html) as a pair of anchor sequences. Each pair of anchor sequences is re-aligned to the reference sequence. If the 5′ terminal of the anchor sequence is aligned to the reference sequence (the starting and terminating loci are denoted as A3 and A4, respectively), and the 3′ terminal of the anchor sequence is aligned to the upstream of a matching locus of the 5′ terminal of the anchor sequence (the starting and ending loci are denoted as A1 and A2, respectively), and a splicing locus (GT-AG) exists between A2 and A3 of the reference sequence, then this read is used as a candidate circRNA. The screening parameters are -q 5, -a 20, -m 2, -d 2, and --noncanonical. The screening criteria are (1) 1-q 5: yjr minimum support number of the anchor sequence alignment is 5; (2) -a 20: the anchor sequence is 20 bp; (3) -m 2: a branch point cannot occur elsewhere in the range of two nucleic acids in the anchor sequence; (4) -d 2: sequence alignment only supports two mismatches; and (5) --noncanonical: GU/AG appears on both sides of a splicing locus and a clear breakpoint can be detected. Moreover, the circRNAs are screened using the default parameters of the CIRIexplorer software. 887 circRNAs are obtained through the find_circ software analysis, and 920 circRNAs are obtained by the CIRIexplorer software. According to the reverse splicing reads of the circRNAs, the intersection of the two prediction results is obtained, and 97 circRNAs are obtained (Table 1).
Based on the results of the find_circ analysis, the reverse splicing reads at the circRNAs looping position are extracted using the samtools view—R instruction for subsequent nucleic acid variation analysis.
SNP calling is performed on the extracted reads sequence using the GATK (version: 4.0.1.0) software in the following steps: first, variation detection is performed on two samples by using a HaplotypeCaller tool in the software; --pair-hmm-gap-continuation-penalty parameter is set as 10, and the remaining parameters are set as default values to obtain variation information of each sample; the variation files of each sample are combined by a CombineGVCFs tool; and finally, allelic variation detection is performed on each sample by using a GenotypeGVCFs tool, and a vcf file is generated, where the vcf file contains the mutation loci and genotype information of all samples (Table 2).
The SNPs in the reverse splicing reads are used as markers, and the number of reverse splicing reads on the SNPs is collected and aligned as the expression quantity of the candidate circRNA allelic loci (Table 2).
The results show that only 44.7% of the circRNA allelic loci in the leaves of P. tomentosa are balanced and the remaining loci are unbalanced.
High-temperature treated leaves of P. simonii are used for total RNA extraction using the RNA extraction kit (MagJ ET Plant RNA Purification Kit, No. K2772); and rRNA is removed using the Ribo-Zero™ rRNA Removal Kits (Plant) (No. MRZPL116). Then, the Poly (A) RNAs are combined by using a magnetic bead method, to obtain a Poly (A)-RNA sample; and the linear RNA is digested with RNase Rd (the reaction system includes 5 μg of RNA, 5 μL of 10× Reaction Buffer, 20 U of RNase R, and RNase-Free water supplemented to 50 μL), to obtain a Poly (A)-/Ribo-RNA sample. The strand-specific cDNA library is constructed using a SMART Kit (a SMART cDNA Library Construction Kit, NO. 634901).
Paired-end sequencing is performed using Illumina HiSeq™2500 with a sequencing data volume of 12 G. The linker and the redundant sequence are removed, and the transcript is spliced by the Cufflinks software. 20-nt of anchor sequence is extracted, with find_circ, from either end of the reads that are not aligned to the reference sequence. Each pair of anchor sequences is re-aligned to the reference sequence. If the 5′ terminal of the anchor sequence is aligned to the reference sequence (the starting and terminating locus are denoted as A3 and A4, respectively), and the 3′ terminal of the anchor sequence is aligned to the upstream of this locus (the starting and terminating loci are denoted as A1 and A2, respectively), and a splicing locus (GT-AG) exists between A2 and A3 in the reference sequence, then this read is used as a candidate circRNA. The screening parameters are -h, -v, -s, -G, -n, -p, -q, -a, -m, -d, --noncanonical, --randomize, --allhits, --stranded, --strandpref, and --halfunique. The screening parameters include -q 5, -a 20, -m 2, -d 2, and --noncanonical. The screening criteria of the foregoing parameters are preferably: (1) -q 5: the minimum support number of anchor sequence alignment is 5; (2) -a 20: the anchor sequence is 20 bp; (3) -m 2: a branch point cannot occur elsewhere in the range of two nucleic acids in the anchor sequence; (4) -d 2: sequence alignment only supports two mismatches; and (5) --noncanonical: GU/AG appears on both sides of the splicing locus, and a clear breakpoint can be detected. The circRNAs are screened using the default parameters of the CIRIexplorer software. 804 circRNAs are obtained by the find_circ software analysis, and 670 circRNAs are obtained by the CIRIexplorer software. The intersection of the two prediction results is obtained according to the reverse splicing reads of the circRNAs to obtain 121 circRNAs (Table 3).
The reverse splicing reads data tag files at the circRNAs looping position are sorted according to the result of the find_circ analysis, and the reverse splicing reads at the circRNAs looping position are extracted using the samtools view—R instruction for subsequent nucleic acid variation analysis.
SNP calling is performed on the extracted reads sequence using the GATK (version: 4.0.1.0) software in the following steps: first, variation detection is performed on two samples by using a HaplotypeCaller tool, --pair-hmm-gap-continuation-penalty parameter is set as 10, and the remaining parameters are default values to obtain variation information of each sample; the variation files of each sample are combined by a CombineGVCFs tool; and finally, allelic variation detection is performed on each sample by using a GenotypeGVCFs tool, and a vcf file is generated, where the vcf file contains the mutation loci and genotype information of all samples (Table 4).
The SNPs in the reverse splicing reads are used as markers, the number of reverse splicing reads on the SNPs is collected and aligned as the expression quantity of the candidate circRNA allelic loci (Table 4).
The results show that only 25.8% of the circRNA allelic loci in the leaf tissues of P. simonii treated by high temperature stress are of balance expression, and the remaining loci are of imbalance expression.
It can be seen from the above examples that the method provided by the present invention uses strand-specific library RNA sequencing, combines with circRNA analysis software and nucleic acid variation analysis software, to accurately analyze the expression pattern of plant circRNA allelic loci in a high-throughput manner. It can be seen from the results in the examples that the method can achieve high-throughput analysis of allelic expression of plant circRNAs, and provides a new research approach for systematically analyzing the allelic expression pattern of the circRNAs by fully utilizing the results of transcriptome sequencing.
The foregoing descriptions are only preferred implementation manners of the present invention. It should be noted that for a person of ordinary skill in the art, several improvements and modifications may further be made without departing from the principle of the present invention. These improvements and modifications should also be deemed as falling within the protection scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
201811582470.8 | Dec 2018 | CN | national |