Embodiments of the present disclosure generally relate to a field of bioinformatics, more particularly, to an effective and accurate bioinformatics analysis method for study plant genome methylation.
Modification of DNA methylation is one important aspect in epigenetics research, serving in many biological phenomenon and processes, for example: dosage compensation, DNA site polymorphism, transposon silence and etc. Current methods of studying DNA methylation combined with high-throughput sequencing technology comprise: bisulfite sequencing (BS-sequencing), methyl-binding protein (MBD) by means of methylated-cytosine combining protein, methylated DNA immune-precipitation (MeDIP) by means of antibody capture site, reduced representation bisulfite sequencing (RRBS) by means of methylated-cytosine site-specific enzyme digestion, and etc. MBD sequencing is more sensitive to parts with a hypermethylation and a medium density of CpG, MeDIP-sequencing is more sensitive to parts with a hypermethylation and a high density of CpG, however, both are not accurate enough. Although the BS-sequencing can accurately analyze a methylation status of each C base and plot a DNA methylation map in a single-base resolution, it requires large volume of sequencing data with a high cost of sequencing. The reduced representation bisulfite sequencing (RRBS) is based on bisulfite sequencing (BS), comprising: firstly selecting a partial region in a whole genome by an enzyme digestion technology, and then performing BS-sequencing, which has some advantages in cost comparing with BS-sequencing, however, it has difficulties in enriching large amount of mCHG and mCHH in a methylation form from a plant sample.
Therefore, currently an effective and accurate method for study plant genome methylation still needs to be developed.
In order to realize a detection of DNA methylation by a massive sequencing without BS sequencing, the present disclosure provides a bioinformatics analyzing method for detecting a DNA methylation based on MspJI digestion, in which MspJI is a modification-dependent restriction enzyme. A method of enriching a methylated site by MspJI digestion does not need to subject a whole genome to a bisulfite treatment, which only obtains information of the methylated site and nearby sequence thereof Such method yields a lower data volume in relative to a whole genome bisulfite sequencing, which is a simple and convenient methylation sequencing method with a moderate operating condition. Accordingly, a bioinformatics analyzing method correspondingly is designed, to determine a recognition site, a methylation site and a type thereof in an enzyme-digested fragment, and embodiments of the subsequent analyzing method are also provided.
In one aspect, there is provided a method of detecting a genome DNA methylation, comprising following steps:
In another aspect, there is provided a method of analyzing a genome methylation, comprising following steps:
Further detailed description will be given combining with following Figures and embodiments to make the purpose, technical solution and advantages more obvious and clear. It should understand that specific examples described herein are used for explaining but not limiting the present disclosure.
In DNA sequence of the present disclosure,
In the present disclosure, reads refer to sequencing fragments output from sequencer and prior to connecting.
A restriction endonuclease MspJI being sensitive to methylation and having a more divergent homology to E. coli Mrr is used in the present disclosure, which is commercially available, for example, being obtained from New England Biolabs (NEB).
As shown in
In step S1, although any commonly-used sequencing technology in the art may be used for sequencing, as the enzyme-digested fragments are relative short sequences, SE50 is preferred for sequencing. Other high-throughput sequencing technology may also be used in the present disclosure, for example, Illumina GA sequencing technology, or other existing high-throughput sequencing technology.
In step S2, the sequencing result off computer is preferably subjected to a filtration to remove an unqualified read. For example, the unqualified read comprises following two cases: more than 50% bases having a sequencing quality below a certain threshold in all bases of a read; and more than 10% uncertain bases (such as N in Illumina GA sequencing result) in all bases of a read. A low-quality threshold may be determined by those skilled in the art according to specific sequencing technology and sequencing environment. After the unqualified read has been removed, the qualified read is preferably subjected to screening, to retain an intact read without a sequencing adaptor and a read having a length of 28 by to 34 by after trimming off the sequencing adaptor.
The filtered and/or screened reads are preferably aligned to a genome sequence of a species to which the DNA sample belongs, to realize a whole genome location of a read, i.e., an enzyme-digested fragment. Considering the read is generally relative short, a case of being unable to be located by none alignments or multiple alignments may occur, an alignment software is preferably used, for example Soap2.20 is used for twice alignments: 1) by setting a software parameter, the read is aligned to the reference sequence with 2 allowed mismatches in each seed sequence and maximal 4 mismatches in each of the reads, to obtain a first aligned result; 2) by resetting Soap2.20 parameter, the read aligned to multiple positions and an unaligned read in the first aligned result are aligned to the reference sequence without allowed mismatches, to obtain a second aligned result; 3) the first aligned result and the second aligned result are merged together, for calculating an aligning rate and a unique aligning rate. Other short sequences may also be used in a mapping program to realize the alignment.
In step S3, a position of a methylated cytosine on the unique aligning read may be determined in accordance with a relationship between a type and a length of the enzyme-recognized site, and be categorized according to a feature of the read which the methylated cytosine locates. Firstly, whether a methylated cytosine exists in a unique aligning read is determined according to MspJI enzyme digestion features, if a corresponding MspJI recognition site is found at a digested end within a certain distance, then a cytosine in the corresponding MspJI recognition site is a methylated cytosine. Considering a fluctuation of 1 base to 2 bases at the digested site, the enzyme-digested fragments having a length of 28 by to 34 by are classified into 8 types of fragments containing fully methylated recognition site (corresponding C and G site in a complementary strand are all methylated sites): YNCGNR, YCNGR, CNNG, GNNC, CYNRG, CNYRNG, YNNGCNNR and YNNGNCNNR, as well as 2 types of fragments containing a semi-methylated recognition site: CNNR and YNNG, totally 10 types, each type of fragments corresponds to one type of fragment length. It should note that, when being subjected to calculation combining the enzyme-digested site and the type of the read which the methylated cytosine locates, two types of CHG and CHH are unable to be accurately categorized, an overlapping exists between the types (for example, TCCGGA fragment may be any one in two types of YNCGNR or YCNGR), even so, such classification still proved a great convenience for searching and locating a methylated cytosine site based on a relationship between a fragment length and a type of recognition site.
In step S4, a position of a methylated cytosine in a genome is located according to the type of recognition site in each read, combining with an aligning position in Arabidopsis reference genome (TAIR8), and then a basic type of such methylated cytosine is finally determined (i.e., CG, CHG or CHH). Distributions of every recognition site and cytosine type are calculated, the feature of each sequence type is described using SeqLogo.
In step 5, after the methylated cytosine is determined and classified, a sequencing depth of each determined methylated cytosine site is calculated, to yield a file similar to methylated single nucleotide annotation in BS sequencing, for detailed describing information such as chromosome in which each methylated cytosine site locates, sequence coordinate, forward or reverse strand, coverage depth, enzyme-digested recognition site, cytosine type, which are subjected to a calculation to finally determine a total volume and a coverage status of the determined methylated cytosine site, so as to provide status of whole genome MspJI-digested methylation. An exemplary file layout similar to methylated single nucleotide annotation in BS sequencing is specifically shown below:
In the present disclosure, other relative analysis may also be performed, i.e., combining characteristic of the used plant genome, a distribution of methylated cytosine in the genome is also analyzed, for example, a distribution in each element of gene, a distribution in a repetitive sequence region and a distribution of some local regions, etc.
Sample: one whole genome sample of Columbia Arabidopsis leaves;
Sequencing strategy: single ends (SE) Illumina sequencing datasets ;
Specific operational procedure was illustrated below combining with
Step S1 comprised several steps: DNA extraction, enzyme digestion, selection and recycling of enzyme-digested fragments, SE library construction, sequencing on computer. Genome DNA was extracted from the Arabidopsis leaves using cetyltrimethylammonium bromide (CTAB) method followed by phenol: chloroform extraction and ethanol precipitation. The genome DNA sample, after checked by 1% agarose gel electrophoresis to obtain those qualified (
In step S2, the sequencing result off computer was preferably subjected to a filtration to remove an unqualified read, comprising following two cases: more than 50% bases having a sequencing quality below a certain threshold in all bases of a read; and more than 10% uncertain bases (such as N in Illumina GA sequencing result) in all bases of a read. After the unqualified read had been removed, the qualified read was preferably subjected to screening, to retain an intact read without a sequencing adaptor and a read having a length of 28 by to 34 by after trimming off the sequencing adaptor.
The filtered and/or screened reads were preferably aligned to a genome sequence of a species to which the DNA sample belonged, to realize a whole genome location of a read, i.e., an enzyme-digested fragment. Considering the read is generally relative short, a case of being unable to locate by none alignments or multiple alignments would occur, an alignment software Soap2.20 (obtained from soap.genomics.org.cn/) was used for twice alignments: 1) by setting a software parameter, the read was aligned to the reference sequence with 2 allowed mismatches in each seed sequence and maximally 4 mismatches in each of the reads, to obtain a first aligned result; 2) by resetting Soap2.20 parameter, the read aligned to multiple positions and an unaligned read in the first aligned result were aligned to the reference sequence without allowed mismatches, to obtain a second aligned result; 3) the first aligned result and the second aligned result were merged together, for calculating an aligning rate and a unique aligning rate, referring to Table 1. The table 1 showed specific data volume off computer, obtained data volume after filtration and screening, and the total number of sequence unique aligning to Arabidopsis genome after alignments in the Arabidopsis sample. As the enzyme-digested sequence was relative short and an actual distribution of the methylated site, the unique aligning rate was relative low.
Arabidopsis
In step S3, a position of a methylated cytosine on the unique aligning read would be determined in accordance with a relationship between a type and a length of the enzyme-recognized site, and be categorized according to a feature of the read which the methylated cytosine locates. Firstly, whether a methylated cytosine exists in a unique aligning read was determined according to MspJI enzyme digestion features (
In step S4, a position of a methylated cytosine in a genome was located according to the type of recognition site in each read, combining with an aligning position in Arabidopsis reference genome (TAIR8), and then a basic type of such methylated cytosine was finally determined (i.e., CG, CHG or CHH). Distributions of every recognition site and cytosine type were calculated, the feature of each sequence type is described using SeqLogo, referring to
In step 5, after the methylated cytosine is determined and classified, a sequencing depth of each determined methylated cytosine site is calculated, to yield a file similar to methylated single nucleotide annotation in BS sequencing, for detailed describing information such as chromosome in which each methylated cytosine site locates, sequence coordinate, forward or reverse strand, coverage depth, enzyme-digested recognition site, cytosine type, which are subjected to a calculation to finally determine a total volume and a coverage status of the determined methylated cytosine site, so as to provide status of whole genome MspJI-digested methylation, referring to
The upper left panel in
An exemplary file layout similar to methylated single nucleotide annotation in BS sequencing is specifically shown below:
Following experimental steps were performed using the genome DNA sample same as the above described, to obtain BS sequencing data.
The above descriptions are just general examples of the present disclosure, which are not constructed to limit the present disclosure, and any amendments, equivalent replacements or improvements, etc can be made in the embodiments without departing from spirit, principles and scope of the present disclosure.
This Application is a Section 371 National Stage Application of International Application No. PCT/CN2011/002242, filed Dec. 31, 2011 and published as WO/2013/097060 A1 on Jul. 4, 2013, in English, the contents of which are hereby incorporated by reference in their entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2011/002242 | 12/31/2011 | WO | 00 | 6/27/2014 |