The present disclosure relates to fields of genetic engineering technology, genetics and bioinformatics, and more particularly to a method of optimizing an assembly result of sequencing data using a genetic map. Thus, the present disclosure provides a novel method of assembling reads of an individual, comprising a step of constructing a genetic map using genetic markers. In addition, the present disclosure also provides a method of assembling genomic sequencing data into genomic sequence, such as a chromosomal sequence.
The Next-Generation DNA sequencing technology is a high-throughput sequencing technique with low cost, and the principle of the Next-Generation DNA sequencing technology is sequencing by synthesis. As an example, the Solexa sequencing method comprises: firstly, randomly breaking DNA double strands using a physical method; secondly, ligating a specific adaptor to an obtained DNA fragments at both ends, in which the specific adaptor comprises an amplifying primer sequence; thirdly, sequencing DNA fragments ligating the specific adaptor. During sequencing, a DNA polymerase synthesizes a complementary strand to a fragment to be tested using an adaptor, and detect a fluorescence signal carried by the newly incorporated bases to obtain a base sequence, then the sequence of the fragment to be tested may be obtained. These obtained sequences are known as reads. A basic process of the Solexa sequencing method may refer to, for example, http:/www.illumina.com.
To restore a whole sequence status of a genome (for example, assembling reads into a genomic sequence, such as a chromosome sequence), the Next-Generation sequencing method generally takes the way of connecting reads in gradients. Firstly, using an overlapping relationship between reads, the reads are extended as much as possible (i.e. connecting together), to form contigs. Secondly, using a distance relationship between both ends of the reads in a double-ended sequencing, different contigs of the reads comprising two ends are connected together by adding a certain number of N in the middle, and these fragments so formed are known as scaffolds. In each scaffold, a sequential relationship of the contigs about the N-region is already known, and a distance in the DNA sequence thereof is also known. Lastly, an information of these N-regions is restored into ATCG by means of “filling holes”. One method of “filling holes” comprises: finding a double-ended sequencing read with one end falling into a known sequence of the scaffold, and the other end falling into the N-region of the scaffold; calculating all reads falling into the N-region; and partially assembling using the overlapping relationship to obtain the sequence information of the N-region. A general process of sequence connecting may refer to, for example, Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20, 265-72 (2010).
Although a known software may be used to connect the sequencing data (i.e. reads) obtained using the Next-Generation sequencing method, a length of reads obtained using the Next-Generation sequencing method are generally shorter (generally only 100 nt). Thus, there is a certain limitation when performing data connecting: it is difficult to connect the reads to form the genomic sequence, such as the chromosome sequence, only relying on assembly software.
Thus, it is an urgent need to improve the method of assembling the sequencing data (i.e. reads), to further optimize the assembly result of the sequencing data, for example, the reads are connected into the genomic sequence, such as the chromosomal sequence.
In the present disclosure, unless otherwise stated, technical and scientific terms used herein have commonly understood meanings by those skilled in the art. And, laboratory procedures of genetics, molecular biology and nucleic acid chemistry used herein are all conventional procedures widely used in the related field. At the same time, to better understand the present disclosure, definitions and explanations of relevant terms are provided below.
As used herein, the term “genetic map” is also known as a linkage map and a chromosomal map, which shows a relative distance (i.e. a genetic distance) between genes or genetic markers, but does not show a physical distance of the genes or the genetic markers on a chromosome. In the genetic map, a position relationship between the genes and the genetic markers is described using the genetic distance, and the genetic distance is calculated using recombination rate. In general, the longer distance between two genes or genetic markers on the same chromosome, the greater probability of a recombination occurrence in the course of meiosis, the smaller probability of a common genetic. Based on a segregation of offspring characters thereof, a recombination rate thereof may be calculated, so as to calculate the genetic distance on the genetic map thereof. In the case of a recombination rate of two genes or genetic markers being 1%, the genetic distance thereof may be define as 1 cm (centimorgan).
At present, commonly-used genetic markers mainly comprise restriction fragment length polymorphism (RFLP), simple sequence repeats (SSR), sequence-tagged site (STS) and single nucleotide polymorphism (SNP). These genetic markers are all well-known to those killed in the art, which may refer to, for example, Agarwal, M., Shrivastava, N. & Padh, H. Advances in molecular marker techniques and their applications in plant sciences. Plant cell reports 27, 617-631 (2008).
As used herein, term “SNP” refers to a polymorphism of DNA sequence caused by a variation of a single nucleotide. SNP is the most common one among biological heritable variations, which accounts at least 90% of all known polymorphisms. SNP site widely exists in a genome of each species. Specifically, in human genome, averagely every 500 to 1,000 base-pairs having an SNP site, and the total number thereof may be estimated up to 3 million or more.
As used herein, term “reads” refers to sequencing data obtained using various sequencing methods to perform sequencing. For example, a Next-Generation sequencing method, such as
Solexa sequencing method is an optimized method for providing reads.
As used herein, term “scaffold” refers to fragments obtained by connecting the reads by means of an overlapping relationship and a physical distance relationship between the reads.
As used herein, expression “assembling reads into a chromosomal sequence” refers to clustering together the reads derived from an individual, and arranging them in accordance with their order or relative position on a chromosome (optionally, the reads are firstly connected into the scaffolds, and then the steps of clustering and arranging are performed), to obtain a status of the relative position of each fragment on the chromosome, and then to obtain the chromosomal sequence of the individual or a part of the chromosomal sequence thereof. Accordingly, the expression involves a process of clustering and arranging. In the case of the reads completely covering an entire chromosome, an intact chromosomal sequence may be obtained. However, if the reads cannot cover the entire chromosome, then the status of the relative position of these fragments on the chromosome may be obtained, as well as the part of the chromosomal sequence (i.e. there is a part of the chromosomal sequence still unknown, which needs to be determined by further sequencing).
As used herein, expression “assemble reads (or scaffolds)” refers to arranging each read (or scaffold) in accordance with a relationship of the relative position.
As used herein, expression “arrange” not only refers to arranging each read (or scaffold) in accordance with a relationship of the relative position, but also to determining a connecting direction of each fragments.
In the present disclosure, the inventors combine the genetic map with the assembly of the reads, to provide a novel method of assembling the sequencing data (i.e. reads), which optimizes the assembly result of the sequencing data and enables assembling the reads into the genomic sequence, such as the chromosomal sequence.
The present disclosure is at least partially based on the following principle: if the genetic distance between two genes or genetic markers is very short, such two genes or genetic markers may be then regarded as being linked. Usually, the physical distance of the two linked genes or genetic markers on a sequence is also close, and the two linked genes or genetic markers belong to a same chromosome. Thus, the linkage relationship between the genetic markers in the genetic map may be used to cluster together the reads or scaffolds, comprising a linked marker, in accordance with the chromosome, and a size relationship and a relative position between the genetic markers may be used to orderly connect the reads or the scaffolds into the chromosomal sequence, or part sequence of the chromosome.
Specifically, in the present disclosure, the inventors exemplarily construct a genetic map using SNP genetic markers. An obtained genetic map comprises a large amount of the SNP markers, and provides a linkage relationship among these SNP markers. Accordingly, based on the linkage relationship among the SNP markers in the genetic map, reads or scaffolds comprising a linked SNP marker may be clustered together in accordance with a chromosome. Further, based on a genetic distance and a relative position between the SNP markers, the reads or the scaffolds belonged to the same chromosome may be orderly arranged, to realize assembling the reads into the chromosomal sequence.
Thus, in one aspect, the present disclosure provides a method of assembling reads of an individual. The method may comprise:
constructing a genetic map using genetic markers, in which the genetic map is used to cluster and arrange the reads comprising the genetic markers, to assemble the reads;
In a preferred embodiment, optionally, prior to clustering and arranging the reads, the reads are connected into scaffolds, and then the genetic map is used to cluster and arrange the scaffolds. Methods well-known in the art may be used to connect the reads into the scaffold, for example, a Soap Denovo assembly software is used.
In a preferred embodiment, the genetic markers are SNP site markers.
In a preferred embodiment, the reads derived from a progeny population of the individual are aligned to the scaffolds of the individual, to search and determine the SNP site markers.
In a preferred embodiment, a SOAP software and a SOAPSnp software are used to search and determine the SNP site markers.
In a preferred embodiment, a Next-Generation sequencing method, such as a Solexa sequencing method, is used to sequence a genome of the individual, to obtain the reads of the individual.
In a preferred embodiment, the individual is an animal (such as mammal) or a plant (such as monocotyledon, dicotyledon and the like)
In another aspect, the present disclosure provides a method of assembling reads of an individual into a chromosomal sequence. The method may comprise:
1) providing the reads of the individual;
2) optionally, connecting the reads into scaffolds;
3) constructing a genetic map using genetic markers;
4) determining a linkage relationship between the genetic markers using a genetic distance between the genetic markers in the genetic map, to cluster together the reads or the scaffolds comprising the genetic markers in accordance with a chromosome;
5) arranging the reads or the scaffolds, belonging to a same chromosome, in a sequential order using the genetic distance between the genetic markers in the genetic map, and determining a connecting direction of each fragment, to assemble the reads into the chromosomal sequence.
In a preferred embodiment, in step 1), a Next-Generation sequencing method, for example a Solexa sequencing method, is used to sequence a genome of the individual, to provide the reads of the individual;
In a preferred embodiment, in step 2), a SOAP Denovo assembly software is used to connect the reads into the scaffolds.
In a preferred embodiment, in step 3), the used genetic markers are SNP site markers.
In a preferred embodiment, in step 3), the reads derived from a progeny population of the individual are aligned to the scaffolds of the individual, to search and determine the SNP site markers.
In a preferred embodiment, in step 3), a SOAP software and a SOAPSnp software are used to search and determine the SNP site markers.
In a preferred embodiment, at least three genetic markers are selected from each read or each scaffold for steps 4) and 5).
The linkage relationship between the genetic markers may be determined based on methods well-known in the art (See, for example, Botstein, D., White, R. L., Skolnick, M. & Davis, R. W. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. American Journal of Human Genetics 32, 314 (1980)).
In a preferred embodiment, in step 4), the linkage relationship between the genetic markers is determined by following steps:
1) calculating a genetic distance between every two of all genetic markers;
2) setting a threshold value according to a distribution of all genetic distances, for example the threshold value is set as a minimum of confidence interval being 95% or less (99%) of the distribution;
wherein two genetic markers of which the genetic distance are below the threshold value are regarded as being linked and belonging to the same chromosome.
In a preferred embodiment, the same number of the genetic markers (such as at least 3) is selected from each read or each scaffold for step 4), and in step 4), the reads or the scaffolds are clustered together in accordance with the chromosome by following steps:
1) clustering together the reads or the scaffolds comprising linked genetic markers, to form linkage groups;
optionally, performing steps 2) and 3):
2) for all reads or all scaffolds which cannot be clustered together to form any linkage groups in step 1),
calculating a quadratic sum of a genetic distance of the genetic markers in each unclustered fragment and a genetic distance of the genetic markers in each fragment of all linkage groups respectively;
selecting an unclustered fragment having a minimal quadratic sum and a corresponding fragment which has been clustered into the linkage groups; and
clustering the unclustered fragment into the linkage groups the corresponding clustered fragment belonged;
3) repeating step 2), until a total genetic distance of the linkage groups reaching a genetic map total distance of species the individual belonged; in the case of the genetic map total distance of the species being unknown, clustering all scaffolds into the linkage groups.
The above-described method may realize clustering most (for example, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or more) or all of the reads or the scaffolds together in accordance with the chromosome.
In a preferred embodiment, in step 5), an MSTmap software is used to arrange the genetic markers, to determine the sequential order of each scaffold comprising the genetic markers and belonging to the same chromosome.
In a preferred embodiment, the individual is an animal (such as mammal) or a plant (such as monocotyledon, dicotyledon and the like).
In another aspect, the present disclosure provides usage of a genetic marker in assembling reads of an individual.
In a preferred embodiment, the genetic markers are SNP site markers.
In a preferred embodiment, the reads of the individual are obtained by sequencing a genome of the individual using a Next-Generation sequencing method, such as a Solexa sequencing method
In a preferred embodiment, the reads of the individual are firstly connected into scaffolds, for example a SOAPDenovo assembly software is used to connect the reads into the scaffolds, and then further assembly is performed using the genetic markers;
In a preferred embodiment, the genetic markers are used to assemble the reads of the individual into a chromosomal sequence.
In a preferred embodiment, the individual is an animal (such as mammal) or a plant (such as monocotyledon, dicotyledon and the like).
General methods of constructing a genetic map using genetic markers, such as SNP, are known to those skilled in the art. (See, for example Shifman, S. et al. A high-resolution single nucleotide polymorphism genetic map of the mouse genome. PLoS biology 4, e395 (2006) and Groenen, M. A. M. et al. A high-density SNP-based linkage map of the chicken genome reveals sequence features correlated with recombination rate. Genome research 19, 510 (2009)). In the present disclosure, SNP is taken as an example, which exemplarily provides a method of constructing a genetic map.
For constructing SNP genetic map, it is usually needed to determine SNP site and calculate genetic distance (i.e. recombination rate) between each SNP site. Accordingly, a progeny population of the target individual, into which the reads are assembled, usually are firstly obtained (for example, the target individual as a parent hybridizes with a reference, and then is subjected to self-breeding, to provide the progeny population), and then the SNP site is determined and the genetic distance between each SNP site is calculated by means of such progeny population (i.e. recombination rate)
Determination of the SNP Site
Taken plant as an example, a plurality of individuals in the progeny population of the target individual, into which the reads are assembled, are sequenced. In general, a sequencing depth of each progeny individual is about 2× to 3× (i.e. the total data volume of the reads reaches to about 2-3 times) or more, to basically cover the entire genomic sequence. Thus, respective sequencing data of the plurality of progeny individuals from the target individual may be obtained (i.e. reads).
Then using, for example, a SOAP software (Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966-7 (2009)), the reads of each progeny individual are aligned back into the parent which is obtained by connecting into the scaffolds (i.e. the target individual); and for example, a SOAPSNP software (Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Research 19, 1124 (2009)) is used to search a SNP site (i.e. a site comprising a difference of a single base between a parent individual and a progeny individual).
Prior to performing alignment, optionally, the reads of each progeny individual may be filtered, to remove unqualified reads in each individual. The unqualified reads include but not limited to following cases:
the base number, which a sequencing quality is below a certain threshold value (determined by specific technique and sequencing environment), exceeds 50% of the base number of all reads;
the base (i.e. N of the reads) number with uncertain sequencing result in the reads exceeds 5% of the base number of all reads;
an exogenous sequence presented in the reads (an introduced exogenous sequence by experiment, for example, except an adaptor sequence of a sample).
when performing alignment, a default parameter of a software is generally used, without an allowance of a gap existence, and the number of mismatching is not more than 5 bases. In addition, those reads which can be aligned to a plurality of sites in a genome are generally filtered out.
Furthermore, the SOAPSNP result is subjected to processing, to search those SNP sites which exist in parent but segregate in progeny. Scaffolds, which these SNP sites locate, and coordinates thereof in the scaffolds are both recorded. The process of searching and determining the SNP site is shown in
Calculation of a Genetic Distance Between SNP Sites
According to the information of the SNP site of each progeny individual, a base of the SNP site in the progeny individual derived from a male parent or a female parent (i.e. genotype information) may be determined, which can further determine a distribution of the base of the SNP site in the parent individual among all progeny individuals (See
in which, same is the number of which two bases of the SNP site derive from the same parent individual, total is the total number of the individuals.
According to the above formula, the genetic distance between every two SNP sites may be calculated, which may further construct a SNP genetic map. On this basis, a linkage relationship between every two SNP marker sites may be determined Normally, two SNP sites of which genetic distances are very close are regarded as being linked, and the physical distance thereof in the chromosome is not too far, i.e., such two SNP sites may be basically regarded as belonging to the same chromosome.
Clustering of the Scaffolds
On the basis of constructed genetic map, by means of the relative position relationship and the linkage relationship between the genetic markers in the genetic map, the scaffolds of the parent individual (target individual) may be clustered in accordance with a chromosome. An exemplary method of clustering the scaffolds in accordance with the chromosome is provided below.
In order to simply the complexity of analysis, it may not need to subject all searched SNP sites to clustering. In general, three SNP site markers may be selected from each scaffold: in which two of them locate at two ends of the scaffolds respectively (one locates at a front-end of the scaffold, and the other locates at a back-end of the scaffold), while the third SNP site marker locates in the middle of the scaffold. The genetic distances between the SNP site located in the middle of the scaffold and several SNP sites surrounding are usually not very long, and two SNP site markers located at two ends of the scaffold close to the every end of the scaffold as much as possible, and the genetic distance between these two SNP site markers is greater than zero.
The genetic distance between every two SNP sites is calculated, the total number of pairwise SNP site markers with equal genetic distance is subjected to statistics, with which a graph is plotted taken the genetic distance as X-coordinate and taken the total number of pairwise SNP site markers as Y-coordinate. Using qqplot (Wilk, M. B. & Gnanadesikan, R. Probability plotting methods for the analysis of data. Biometrika 55, 1 (1968)) function of R software, it has been found that the distribution of the above plotted graph follows Normal Distribution. An abscissa value of the distribution of which a confidence interval is at least 95% is taken as a threshold value, and two SNP site markers of which the abscissa value is less than the threshold value are regarded as belonging to the same chromosome.
Thus, if the genetic marker between two SNP site markers, which locate at different scaffolds, being less than the threshold value, then these two scaffolds are regarded as belonging to the same chromosome. Based on this, all scaffolds may be clustered, and those scaffolds clustered together are regarded as a linked group.
In some cases, there may be some scaffolds which cannot be clustered into any linkage group. In these cases, the scaffolds which cannot be clustered into any linkage group may need to be further clustered into the linkage groups. Accordingly, following method may be used for further clustering:
1) calculating a quadratic sum of a genetic distance of the genetic markers in each unclustered fragment and a genetic distance of the genetic markers in each fragment of all linkage groups respectively;
selecting an unclustered fragment having a minimal quadratic sum and a corresponding fragment which has been clustered into the linkage groups; and
clustering the unclustered fragment into the linkage groups the corresponding clustered fragment belonged;
2) repeating step 1) until a total genetic distance of the linkage groups reach genetic map total distance of species the individual belonged (if the genetic map total distance of the species being unknown, all scaffolds are clustered into the linkage groups), which may realize clustering the scaffolds, which cannot be clustered into any linkage group, into the linkage group. Thus, all scaffolds or at least most scaffolds (for example, at least 50% of the scaffolds, at least 60% of the scaffolds, at least 70% of the scaffolds, at least 80% of the scaffolds, at least 90% of the scaffolds, at least 95% of the scaffolds, at least 96% of the scaffolds, at least 97% of the scaffolds, at least 98% of the scaffolds, at least 99% of the scaffolds, or more scaffolds) of the parent individual (the target individual) may be clustered.
Sorting of the Scaffolds
After clustering the scaffolds, the genetic distance between the genetic markers (for example, the SNP site marker) may be used to sort various scaffolds belonged to the same chromosome. For example, an MSTmap software (Wu, Bhat et al. 2008) may be used to sort the SNP site marker located in the middle of the scaffold. The MSTmap software may be able to sort various scaffolds by constructing a minimum spanning tree, according to a size of the genetic distances between various genetic markers. In general, an actual sequential order of the genetic marker may be obtained by calculating a minimum spanning tree of the graph. Based on this, a relative relationship of various genetic markers, which locate in the middle of the scaffolds, in the linkage group may be obtained, which may further determine the sequential order of various scaffolds belonged to the same chromosome.
Determination of a Connected Direction of the Scaffolds
Furthermore, the genetic distance between the genetic markers (such as SNP site marker) may be used to determine a connected direction of various scaffolds.
For example, after sorting various scaffolds belonged to the same chromosome, a genetic distance between the SNP site markers located at both ends of one scaffold (front-end and back-end) and the SNP site marker located in the middle of previous scaffold, which may determine the connected direction of the one scaffold with the previous scaffold. If a genetic distance between the SNP site marker located at either end of the scaffold and the SNP site marker located in the middle of the previous scaffold is relatively close, then the very end of the one scaffold connects to the previous scaffold, which may determine the connected direction of the one scaffold. Optionally, any other suitable marker combination (for example, SNP site markers located at front-end and middle of scaffold with a pending connected direction, or SNP site markers located at back-end and in middle of scaffold with a pending connected direction, as well as any one of SNP site markers of the previous scaffold) may be used to determine the connected direction of the scaffold
After clustering and sorting the scaffolds as well as determining the connected direction (for example, according to above-mentioned steps), most scaffolds may be clustered together and then aligned to a chromosome or a certain fragment of a chromosome, so as to assemble reads into a chromosomal sequence.
The present disclosure innovatively combines a genetic map with reads together, to provide a novel method of assembling sequencing data (i.e., reads). Comparing with the prior art, the technical solution of the present disclosure has following advantageous effects:
1) it has been solved as a choke point that reads are unable to be assembled into a genomic sequence (such as chromosomal sequence) using a reads assembly software, which optimizes an assembly result of sequencing data;
2) it has been realized that reads are assembled into a genomic sequence, such as a chromosomal sequence, which provides a more powerful tool for genomics.
Reference and examples will be made in details to embodiments of the present disclosure, and it would be appreciated by those skilled in the art that following figures and examples are explanatory, illustrative and used to generally understand the present disclosure, but not construed to limit the present disclosure. According to following detailed descriptions of figures and preferred embodiments, various purposes and advantages of the present disclosure will become apparent to those skilled in the art.
These and other aspects and advantages of embodiments of the present disclosure will become apparent and more readily appreciated from the following descriptions made with reference to the accompanying drawings, in which:
In order to make the purpose, technical solution and advantage of the present disclosure more apparent, a further description will be described in details to the present disclosure. It would be appreciated by those skilled in the art that specific examples described herein are explanatory for the present disclosure, but not be construed to limit the present disclosure.
In the present example, 9311 rice was taken as an example, which exemplarily described the method of assembling reads according to the present disclosure.
Obtaining Scaffolds of 9311 Rice
The genome of 9311 rice was sequenced using Solexa sequencing platform (illumine company), to provide reads of 9311 rice. Then, using methods well-known in the art, for example Soap Denovo assembly software (http://soap.genomics.org.cn/soapdenovo.html), the reads of 9311 rice was connected into scaffolds, these sequence information of the scaffolds may refer to Yu, Hu et al. 2002.
Obtaining Progeny Population of 9311 Rice
The 9311 rice (Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79 (2002)) was subjected to hybridization with pa64 rice (Wei, G et al. A transcriptomic analysis of superhybrid rice LYP9 and its parents. Proc Natl Acad Sci USA 106, 7695-701 (2009)), to obtain F1 generation, and then the F1 generation self-bred for 16 generations, to obtain a progeny population of 9311 rice. 135 progeny individuals were selected randomly from the progeny population obtained from self-breeding for 16 generations, to subject to an individual sequencing having a sequencing depth of 2× (a data volume of twice genome), to provide reads of the progeny individual.
Searching and Determining SNP Site
Taking the scaffolds from the parent 9311 rice as a reference sequence, using SOAP software (Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966-7 (2009)), the reads of the 135 progeny individuals were aligned back to the reference sequence.
Based on the aligned result obtained using SOAP software, SOAPSnp software (See, for example, http://soap.genomics.org.cn/soapsnp.html or Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Research. 19, 1124 (2009)) was used to search SNP site, and determine a genotype of each SNP site in progeny individual (i.e. to determine whether a base of SNP site in the progeny individual derived from the 9311 rice or the pa64 rice).
A statistical result of SNP site from the 9311 rice was shown as Table. 1
As can be seen from the statistical result in Table. 1, the SNP site marker not only had a huge number, but also had a basically uniform distribution in the entire genome. And, these SNP site markers basically covered the entire genome, so as to use in assembling the scaffolds into a genomic sequence (for example a chromosomal sequence)
Clustering and Arranging Scaffolds
In order to clustering the scaffolds, three SNP site markers were selected out from each scaffold, in which, two of them located at two ends of the scaffolds respectively (one located at a front-end of the scaffold, and the other located at a back-end of the scaffold), while the third SNP site marker located in the middle of the scaffold. The genetic distances between every two of all selected SNP site markers were calculated. The number of the pairwise SNP site markers having the same genetic distance was subjected to statistics, with which a graph was plotted taken the genetic distance as X-coordinate and taken the number of pairwise SNP site markers as Y-coordinate (See
A 99% confidence interval of the distribution was calculated, of which a lower limit was taken as a threshold value, so as to obtain a genetic distance having a threshold value of about 3 cm. Thus, if a genetic distance between two SNP site markers being less than 3 cm, then these two SNP site markers were regarded as linked, and belonged to a same chromosome. Accordingly, the scaffolds of which these two SNP site markers located were also regarded as belonging to a same chromosome.
According to the above-described threshold value of genetic distance, all scaffolds were clustered. The results showed that, after clustering, 12 linkage groups were obtained (corresponding to the number of chromosome having a haploid in rice.
Furthermore, those scaffolds which cannot be clustered together to any linkage groups, were clustered by following steps:
1) calculating a quadratic sum of a genetic distance of SNP site marker in each unclustered scaffold with SNP site marker in various scaffolds of all linkage groups; selecting an unclustered scaffold having a minimal quadratic sum and a corresponding scaffold which has been clustered into the linkage groups; and clustering the unclustered scaffold to the linkage groups which the corresponding clustered scaffold belonged; 2) repeating step 1), until a total genetic distance of all linkage groups reached a genetic map total distance of rice species.
According to the above steps, there were total 444 scaffolds had been clustered, the total length of the scaffolds was 338,305,001 bp, which accounted for 88.2% of the genome size. And it had been realized that most scaffolds were clustered together in accordance with the chromosome.
After the clustering steps were completed, an MSTmap soft (Wu, Y, Bhat, P. R., Close, T. J. & Lonardi, S. Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS Genet 4, e1000212 (2008)) was used to sort the clustered scaffolds, to determine the sequential relationship thereof in the linkage groups. Then, a relative genetic distance between the SNP site marker located at both ends of various scaffolds and the SNP site marker located in the middle of the previous scaffold thereof, to determine a connected direction of various scaffolds. By the above-described assembly method, 12 linkage groups (corresponding to 12 chromosomes of the 9311 rice) were obtained, of which the detailed information had been shown in Table. 2. In addition,
As can be seen from the above result, the method of the present example using a genetic map comprising SNP site marker, broke through the choke point that the Next-Generation sequencing technique-based assembly software cannot connect reads into chromosomal sequence, and successfully realized connecting the reads of the 9311 rice genome into the chromosomal sequence, which provided a more powerful tool for the genomics.
In addition, the above-describe method was also used to assemble the reads of individual derived from watermelon which is a species with a smaller genome (11 chromosomes). The assembly result of such individual reads was shown in
Although specific embodiments of the present disclosure have been described in details, the above embodiments cannot be construed to limit the present disclosure. And, it would be appreciated by those skilled in the art that various modification and changes can be made in the embodiments according to all teachings which has been already disclosed, which are all within the scope of the present disclosure. The full scope of the present disclosure is given by the claims and any equivalents thereof.
In the present text, additional details of publications and other materials for illustrating the present disclosure or providing implement of the present disclosure are all incorporated herein by reference, and following references are provided for convenience.
1. Kosambi, D. (1944). “The estimation of map distances from recombination values.” Ann. Eugen. 12: 172-175.
2. Li, R., Y. Li, et al. (2009). “SNP detection for massively parallel whole-genome resequencing.” Genome Research 19(6): 1124.
3. Li, R., Y. Li, et al. (2008). “SOAP: short oligonucleotide alignment program.” Bioinformatics 24(5): 713.
4. Li, R., H. Zhu, et al. (2010). “De novo assembly of human genomes with massively parallel short read sequencing.” Genome Research 20(2): 265.
5. Wu, Y., P. R. Bhat, et al. (2008). “Efficient and Accurate Construction of Genetic Linkage Maps from the Minimum Spanning Tree of a Graph.” PLoS Genet 4(10): e1000212.
6. Yu, J., S. Hu, et al. (2002).“A draft sequence of the rice genome (Oryza sativa L. ssp. indica).” Science 296 (5565): 79.
7. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20, 265-72 (2010).
8. Agarwal, M., Shrivastava, N. & Padh, H. Advances in molecular marker techniques and their applications in plant sciences. Plant cell reports 27, 617-631 (2008).
9. Botstein, D., White, R.L., Skolnick, M. & Davis, R. W. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. American Journal of Human Genetics 32, 314 (1980).
10. Shifman, S. et al. A high-resolution single nucleotide polymorphism genetic map of the mouse genome. PLoS biology 4, e395 (2006).
11. Groenen, M. A. M. et al. A high-density SNP-based linkage map of the chicken genome reveals sequence features correlated with recombination rate. Genome research 19, 510 (2009).
12. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966-7 (2009).
13. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Research 19, 1124 (2009).
14. Kosambi, D. The estimation of map distances from recombination values. Annals of Human Genetics 12, 172-175 (1943).
15. Wilk, M. B. & Gnanadesikan, R. Probability plotting methods for the analysis for the analysis of data. Biometrika 55, 1 (1968).
16. Wu, Y., Bhat, P. R., Close, T. J. & Lonardi, S. Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS Genet 4, el000212 (2008).
17. Wei, G et al. A transcriptomic analysis of superhybrid rice LYP9 and its parents. Proc Natl Acad Sci USA 106, 7695-701 (2009).
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2011/076840 | 7/5/2011 | WO | 00 | 1/3/2014 |