The present invention relates to a method for determining chromosome abnormalities, and more particularly, to a new method for determining chromosome abnormalities, including sequencing next-generation sequencing (NGS) sequence data regardless of an NGS analysis platform, determining male or female by extracting a unique-read from the sequenced sequence data, and setting a threshold line using initial learning by linear discriminant analysis (LDA) of existing data, thereby being applied for both autosomes and sex-chromosomes, and improving accuracy and sensitivity as the number of diagnoses increases.
‘Prenatal diagnosis’ refers to a process of determining and diagnosing presence or absence of fetal diseases before the birth of the fetus. According to recent statistics, it has been reported that congenital malformed children account for about 3% of all neonates and about 20% of the congenital malformed children are caused by chromosome abnormalities. Specifically, the congenital malformed child which is widely known as Down syndrome corresponds to 26% of the congenital malformed children.
Due to the increased birth rate of malformed children and the development of various prenatal diagnostic devices, interest in prenatal diagnosis is increasing day by day. In particular, in the case where there is an elderly pregnant woman over 35 years of age, there is a pregnant woman with a childbirth history of chromosome abnormalities there is one of the parents having a family history of genetic disease, there is a family history of genetic disease, there is a risk of neural tube defects, and fetal malformation is suspected in maternal serum screening and ultrasonography, the prenatal diagnosis is required.
The prenatal diagnosis method may be largely divided into invasive and noninvasive diagnostic methods. Examples of the invasive diagnostic method include chorionic villi sampling (CVS) performed during 10 and 12 weeks of pregnancy, amniocentesis of analyzing fetal chromosomes by measuring a concentration of AFP in amniotic fluid using immunoassay during 15 to 20 weeks of pregnancy, a cordocentesis method in which fetal blood is extracted directly from the umbilical cord under ultrasound-induced during 18 to 20 weeks of pregnancy, and the like.
However, these invasive diagnostic methods may cause abortion, illness or malformation by impacting the fetus during the examination process. Methods of securing fetal material by amniocentesis or chorionic villus sampling may be invasive, and non-negligible risks to pregnancy may be caused even by skilled clinicians. In current practice, these invasive diagnostic methods are generally used when there is a sign that the probability of down syndrome fetal pregnancy due to maternal age or pre-screening through biochemical testing or ultrasound examination.
Noninvasive diagnostic methods have been developed to overcome the problems of these invasive diagnostic methods. For example, the pre-embryonic genetic diagnosis method is a technique for selecting embryos without preimplantation intrauterine genetic defects using molecular genetics or cytogenetic techniques used in in-vitro fertilization. In addition, a quantitative-fluorescent PCR (QF-PCR) fluorescence assay for rapid diagnosing chromosome aneuploidy is a quick screening test method of measuring and analyzing an amount of amplified DNA labeled with fluorescence by a DNA automatic sequence analyzer after amplifying short tandem repeats (STR) of the DNA that are specific for each chromosome and labeled with the fluorescence by a multiplex PCR method. In addition, in order to find a copy number change, a chromosome microarray (CMA) method is known for collecting and inspecting DNA sequences mapped onto a glass slide.
Meanwhile, with the development of a sequencing technology, as it becomes possible to decode large-scale genome information, genome analysis methods based on a next-generation sequencing (NGS) technology are utilized even in the field of prenatal diagnosis. In particular, it is known that cellular free DNA in the plasma of pregnant women contains components of the fetal origin (Lo et al., 1997, Lancet 350, 485-487), and in cell free plasma DNA (hereinafter, referred to as “serum DNA”), 5% to 20% originates from the fetal, and the remainder is often formed of short DNA molecules (80 to 200 bp) of the maternal (Birch et al., 2005, ClinChem 51, 312-320; Fan et al., 2010, ClinChem 56, 1279-1286).
Prenatal diagnosis methods for isolating the fetal cells from the maternal blood and analyzing chromosomes using these facts are known. In general, since the conditions having chromosome aneuploidy which is caused by excess chromosomes or chromosome defects produce an imbalance of a fetal DNA molecule cluster in the detectable maternal free plasma DNA, methods of analyzing chromosome abnormalities using the same have been developed.
In principle, if the cellular free DNA in the plasma is not diluted by the maternal component, the excess chromosome that causes characteristics of T21 is expected to produce more than 50% DNA molecules derived from the chromosome as compared to normal pregnancy. However, when considering a typical value of 10% for the components of the cellular free plasma DNA of fetal origin, the resulting imbalance is only 5%, or expected to be a relative increase in the number of chromosome 21-derived fragments at a value of 1.05 compared to 1.00 for normal pregnancy. In situations where the fetal component of plasma DNA is smaller or larger than the 10% value, the imbalance in the number of chromosome 21-derived molecules within the cluster of molecules in the maternal plasma is correspondingly smaller or larger.
Thus, the basis of this non-invasive diagnostic test is obtaining nucleotide sequence data for DNA molecules from the maternal plasma (‘DNA sequence analysis’). After partial or complete nucleotide sequence information is obtained from individual DNA molecules, bioinformatics techniques need to be applied to assign individual molecules to the chromosome originated by the molecules most simply by comparison with the reference human genome(s).
Considering that bioinformatic methods can be reliably applied to obtain some nucleotide sequence data for a sufficiently large number of plasma DNAs and assign a sufficiently large number of genes to its chromosome origin, statistical methods may be applied to determine the presence or absence of chromosome imbalances in a cluster of plasma DNA molecules while retaining statistical reliability.
Up to now, in this diagnostic method, in order to obtain a sequence having a length enough to be assigned to a chromosome origin thereof, a large-scale parallel DNA sequencing technique which generates high-quality sequence data that is relatively error-free (known as next-generation sequencing or second-generation sequencing) was used.
This specific automated sequencing device generates sequence data that is substantially less than that normally required for general genomic sequencing. The sequence data generated as such is characterized by frequent errors. Types of these errors are various, but ‘insertion-deletion (indel)’ is most common and is an error caused by a sequencing device which delivers an inaccurate excess base (insertion) or a deleted base. In addition, it is difficult to effectively sequence a short homopolymer run (i.e., a run of several identical bases). In addition, the sequencing error may also include “mismatch” in which the base is incorrectly assigned, and tends to indicate various errors.
In addition, such a massive parallel sequencing has disadvantages in that the performed sequencing requires much time and is performed with high quality in a full-service genome sequencer, mainly Illumina HiSeq, which generates very large data requiring expensive bioinformatics. In addition, the method of performing the specific analysis varies depending on a kind of full-service genome sequencer, and the execution time and the analysis process may take several weeks as a whole.
In order to solve the problems of the related art as described above, the present invention is not limited to a sequencing method by a specific automatic sequencer and a normalization method thereof in the related art, and an object of the present invention is to provide a new method for determining chromosome abnormalities which are able to use generated sequence information and be applied for both autosomes and sex-chromosomes.
In order to solve the above objects, an aspect of the present invention provides a method for determining chromosome abnormalities including:
a first step of extracting a unique read from sequenced sequencing data of a target chromosome;
a second step of setting a threshold line for determining chromosome aneuploidy by linear discriminant analysis (LDA) by dividing and labeling normality and aneuploidy of chromosome data pre-verified for the normality and aneuploidy; and
a third step of determining whether there is aneuploidy of the unique read-target chromosome gene extracted in the first step by the threshold line set in the second step.
In the method of determining the chromosome abnormalities according to the present invention, in the second step of setting the threshold line for determining the aneuploidy, the normality and the aneuploidy of the chromosome data pre-verified for the normality and the aneuploidy are divided and labeled to be initially learned by the LDA and a minimum value of the aneuploidy chromosome data among the pre-verified chromosome data is set as the threshold value.
In the method of determining the chromosome abnormalities according to the present invention, the LDA technique refers to a linear discriminant analysis method and refers to a method of setting an initial threshold value by analyzing the pre-verified chromosome data and setting a minimum value of the aneuploidy chromosome data as the threshold line by additionally analyzing the accumulated samples.
In the method of determining the chromosome abnormalities according to the present invention, in the step of determining whether there is the aneuploidy of the new target chromosome gene according to the criteria set by the LDA method, the presence or absence of chromosome abnormalities is determined by setting a range of a normal sample from the pre-verified chromosome data and setting a minimum value of the aneuploidic data as the threshold value.
In the method for determining the chromosome abnormalities according to the present invention, in the step of extracting the unique read from the target chromosome, the unique read which is divided into a 90 kb bin region and has the GC content of 0.35 to 0.55 or less is extracted.
The method for determining the chromosome abnormalities according to the present invention further includes, after the first step, a 1-1 step of calculating UR(x) % (percentage of reads uniquely matched to a chromosome X) and UR(y) % (percentage of reads uniquely matched to a chromosome Y) represented by the following Formulas from the extracted unique read;
UR(x) %=Number of reads of chromosome X (chrX)/total number of (autosomes) reads×100
UR(y) %=Number of reads of chromosome Y (chrY)/total number of (autosomes) reads×100
a 1-2 step of discriminating gender from the UR(x) % and the UR(y) %; and
a 1-3 step of discriminating gender from the number of reads of the region matched to a Y-specific region in the step of discriminating the gender from the UR(x) % and the UR(y) %.
In the method for determining the chromosome abnormalities according to the present invention, in the step of discriminating the gender from the UR(x) % and the UR(y) %, the gender is discriminated from the number of reads in the region (Table 1) matched to the Y-specific region which selects only a pure chrY region by selecting a pseudoautosomal region by comparing chrX and chrY to remove a chrX region.
In the method for determining the chromosome abnormalities according to the present invention, the chromosome is at least one chromosome selected from the group consisting of chromosome 13, chromosome 18, chromosome 21, chromosome 3, chromosome 7, and chromosome 12, a chromosome X or a chromosome Y.
In the method for determining the chromosome abnormalities according to the present invention, it is possible to be extended to whole autosomes when the autosomes are targeted, and in the method for determining the chromosome abnormalities according to the present invention, examples of the chromosome abnormalities include:
down syndrome (Trisomy 21), Edward syndrome (Trisomy 18), Patau syndrome (Trisomy 13), Trisomy 9, Warkany syndrome (Trisomy 8), Cat Eye syndrome (4 copies of chromosome 22), Trisomy 22, and Trisomy 16.
Additionally or alternatively, the detection of an abnormality of genes, chromosomes, or some of chromosomes, and the copy number may include detection and/or diagnosis of a condition selected from the group consisting of: Wolf-Hirschhorn syndrome (4p−), Cri du chat syndrome (5p−), Williams-Beuren syndrome (7−), Jacobsen syndrome (11−), Miller-Dieker syndrome (17−), Smith-Magenis syndrome (17−), 22ql 1.2 Deletion syndrome (also known as Velocardiofacial syndrome, DiGeorge syndrome, conotruncal anomaly face syndrome, congenital thymic dysplasia, and Strong's syndrome), Angelman syndrome (15−), and Prader-Willi syndrome (15−).
Additionally or alternatively, the detection of the abnormality of the chromosome copy number may include detection and/or diagnosis of a condition selected from the group consisting of Turner syndrome (Ullrich-Turner syndrome or single chromosome X), Klinefelter syndrome, 47,XXY or XXY syndrome, 48,XXXY syndrome, 49,XXXXY syndrome, Triple X syndrome, XXXX syndrome (also referred to as tetrasomic X, quadruple X, or 48,XXXX), XXXXX syndrome (also referred to as pentasomic X or 49,XXXXX), and XYY syndrome.
In the method for determining the chromosome abnormalities according to the present invention, since the threshold line for determining the chromosome aneuploidy is set by the LDA method from the existing sequenced data, the more an amount of sequenced data to be used, the higher accuracy and sensitivity of the determination, and as a result, the accuracy and sensitivity of the determination may be continuously improved at the time of performing the method many times while the data is continuously accumulated.
That is, in the method for determining the chromosome abnormalities according to the present invention, it is possible to perform the first to third steps for determining the chromosome abnormalities N times while continuously adding sequenced data sequences. When a chromosome data used at the time of the N−1-th determination is referred to as Dn−1 and a chromosome data used at the time of the N-th determination is referred to as Dn, the determination of the aneuploidy for the chromosome data Dn used at the time of the N-th determination is a threshold value derived from the chromosome data Dn−1 used at the time of the N−1-th determination.
The threshold value is affected by a specific algorithm, but a value close to the aneuploidy is set to one value or the threshold value is set to two values, and as a result, the determination may also be flexibly improved.
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is obtained by a next-generation sequencing platform. It will be understood by those of ordinary skill in the art that the method for obtaining the sequence data according to the present invention is not limited to any specific technique.
The sequencing platform was discussed and reviewed from literatures [Loman et al. (2012) Nature Biotechnology 30(5), 434-439]; [Quail et al. (2012) BMC Genomics 13, 341]; [Liu et al. (2012) Journal of Biomedicine and Biotechnology 2012, 1-11]; and Meldrum et al. (2011) ClinBiochem Rev. 32(4): 177-195]; and the sequencing platform reviewed from the literatures is included in the present application by reference.
In the method for determining the chromosome abnormalities according to the present invention, the next-generation sequencing platform is selected from a Roche 454 (i.e., Roche 454 GS FLX), a SOLiD system from Applied Biosystems (i.e., SOLiDv4), GAIIx, HiSeq 2500 and MiSeq sequencers from Illumina, Proton and S5 sequencers of Ion Torrent semiconductor sequencing platforms from Life Technologies, PacBio RS from Pacific Biosciences, and 3730xl from Sanger.
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is obtained by a sequencing platform including the use of a polymerase chain reaction.
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is obtained by a sequencing platform including the use of sequencing by synthesis.
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is obtained by a sequencing platform including the use of ions, for example, hydrogen ion release.
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is obtained by a sequencing platform including the use of a semiconductor-based sequencing method. The advantage of the semiconductor-based sequencing method is that the manufacturing cost of devices, chips and reagents is low, the sequencing process is rapid (despite off-set by emPCR) and the system can be extended, but it may be somewhat limited to a bead size used in the emPCR.
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is obtained by a sequencing platform including the use of a nanopore-based sequencing method. The nanopore-based method includes the use of organic-type nanopores that imitate conditions of a cell membrane and a protein channel of living cells, like a technique used by, for example, Oxford Nanopore Technologies (e.g., Literature [Branton D, Bayley H, et al. (2008). Nature Biotechnology 26 (10), 1146-1153]).
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is obtained by an Ion Torrent platform from Life Technologies or MiSeq from Illumina. A sequencing technique by synthesis of Illumina (SBS) is currently successful, and a next-generation sequencing platform which is widely adopted worldwide. A TruSeq technique supports large-scale parallel sequencing using an exclusive reversible terminator-based method that enables its detection when a single base is included in a growing DNA strand. A fluorescence-labeled terminator is imaged by adding each dNTP and then cleaved to allow introduction of the next base. Since all four reversible terminator-binding dNTPs exist during each sequencing cycle, natural competition minimizes introduction bias.
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is obtained by an Ion Torrent personal genome machine (Ion Torrent PGM) from Life Technologies.
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is obtained by an Ion Torrent platform from Life Technologies, for example, Ion Proton and S5 having PI or PII chips, and multiplex capable iteration based on additional derivative devices and components thereof.
In an additional embodiment, the next-generation sequencing platform is a personal genome machine (PGM), which is the Ion Torrent personal genome machine from Life Technologies. The Ion Torrent device uses a strategy similar to sequencing by synthesis (SBS), but detects signals by the release of hydrogen ions according to the activity of a DNA polymerase during the nucleotide introduction. Essentially, the Ion Torrent chip is a very sensitive pH meter. Each ion chip includes millions of ion-sensitive field effect transistor (ISFET) sensors that allow simultaneous detection of multiple sequencing reactions. The use of the ISFET device is well known to those skilled in the art and may be performed within a range of a technique which may be used to obtain the sequence data required by the method of the present invention. (Prodromakis et al. (2010) IEEE Electron Device Letters 31(9), 1053-1055; Purushothaman et al. (2006) Sensors and Actuators B 114, 964-968; Toumazou and Cass (2007) Phil. Trans. R. Soc. B, 362, 1321-1328; WO 2008/107014 (from DNA Electronics Ltd); WO 2003/073088 (from Toumazou); US 2010/0159461 (from DNA Electronics Ltd); each sequencing method is included in the present application by reference).
In the method for determining the chromosome abnormalities according to the present invention, the sequenced sequence data is normalized or not. That is, the method for determining the chromosome abnormalities according to the present invention is not limited to the sequencing method, and may determine the chromosome abnormalities even in the case of performing or not standardization and normalization of the sequenced sequence data.
The method for determining the chromosome abnormalities according to the present invention is not limited to the sequencing method and the normalization method thereof by a specific automatic sequencing device in the related art. The method can be usefully used for prenatal diagnosis by using the generated sequence information, being applied to autosomes and sex-chromosomes, and early determining presence or absence of malformation due to abnormality of the number of fetal autosomes and sex-chromosomes based on a commercial application of a non-invasive method because as the number of diagnoses increases, accuracy and sensitivity increase.
In the method according to the present invention, when many sequencing data and abnormality determination data therefor are accumulated, it is possible to set a precise threshold line by a linear discriminant analysis (LDA) method, thereby obtaining the sensitivity much higher than that of the conventional method.
Hereinafter, the present invention will be described in more detail through Examples. These Examples are just to exemplify the present invention, and it is apparent to those skilled in the art that it is not interpreted that the scope of the present invention is not limited to these Examples.
Unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as those commonly understood by those skilled in the art. In general, the nomenclature used and the experimental method described below in this specification is well-known and commonly used in the art.
Plasma was extracted from the blood collected from the mother, and a library was prepared by extracting 30 ng or more of cfDNA from the plasma. And both Life Tech and Illumina were combined with an adapter. Thereafter, pooling was performed by E-gel size selection using Life Tech equipment, bead size selection was performed using Illumina, and sequencing was performed by pooling.
Sequenced fastq files were sorted and PCR duplication was removed to extract unique reads. Only the perfectly matched reads were sorted, and all the regions in the sorted sequence were divided into 90 kb bin regions and reads with a GC content of 0.35 to 0.55 or less were extracted.
A percentage UR(x) % of free reads which are uniquely matched with a chromosome X and a percentage UR(y) % of free reads which are uniquely matched with a chromosome Y represented by the following Formulas were obtained.
−UR(x) %=Number of reads of chromosome X (chrX)/total number of (autosomes) reads×100
−UR(y) %=Number of reads of chromosome Y (chrY)/total number of (autosomes) reads×100
As shown in Table 1 below, a Y-specific region was set, and the number of reads was calculated based on the Y-specific region, and then when the number of reads was less than 2, it was determined as female and when the number of reads was 2 or more, it was determined as male.
In Table 1 below, the Y-specific region is defined as a pure chrY region by removing a chrX region after removing a pseudoautosomal region by comparing chrX and chrY, and the Y-specific region selected as follows. The present invention is characterized in that it is possible to easily discriminate male and female by using a method of counting the number of reads in a region mapped to the Y-specific region.
2,649,520-
59373566
chrX:154,931,044=chrY:59,034,050-
59,363,566
In
In the present invention, the data identified by the standard method is initially learned using the LDA method, a minimum value of aneuploidic data is extracted as a threshold value, and normal, aneuploidy, and threshold of a target chromosome may be predicted from this.
Conventional methods such as Z-score and NCV of Illumina are typically used, but various normalization algorithms (QDNAseq, HMMcopy, Deeptools, etc.) for normalizing the entire data using low-depth data have been introduced.
Referring to
In
In
In addition, in the method for determining the chromosome abnormalities of the present invention, it is possible to determine chromosome abnormalities without performing a separate normalization process with respect to the sequenced data regardless of a specific platform.
In
In
From this, in the case of the method for determining the chromosome abnormalities by the LDA technique according to the present invention, it can be seen that the same result can be obtained without using the known normalization algorithm or the Z-score.
The cases of chr21, chr18 and chr13 are discriminated from the data confirmed by the existing standard method of Example 2, and a minimum value of the aneuploidic data is extracted as a threshold value using the LDA method for each of the chr21, chr18 and chr13 data, thereby predicting and determining normal, aneuploidy, and threshold.
In the method of determining the chromosome abnormalities according to the present invention, that is, by performing the sorting sequence using existing data, performing normalization, and then setting a minimum value of the aneuploidic data selected by the LDA method as a threshold value, results of determining aneuploidy of chromosomes chr21, chr18, and chr13 based on the threshold value were shown in
In
In
In
Also, as shown in
It has been confirmed that the method for determining the chromosome abnormalities of the present invention is able to be applied not only to the most well-known chr13, chr18, and chr21, but also to other autosome abnormalities.
First, Normalization was performed by a conventionally used method from the three chromosome sequencing data of chr3, chr7, and chr12. And z-score was calculated by using the number of reads, and then results are shown in
In
In
In
In
It can be seen that in a total of six examples of three chromosomes (chr13, chr18, and chr21) and chr3, chr7, and chr12 among 22 autosomes, the normal and the aneuploidic samples are clearly discriminated. As a result, it can be seen that it is possible to extend the method for determining the chromosome abnormalities according to the present invention to all chromosomes.
With respect to 246 samples, UR.X and UR.Y indicated by the following Formulas were obtained, and the results were shown in
UR(x) %=Number of reads of chromosome X (chrX)/total number of (autosomes) reads×100
UR(y) %=Number of reads of chromosome Y (chrY)/total number of (autosomes) reads×100
In
In the case of the male sample, as shown in
When a lot of data is accumulated, more learning is performed, so it is possible to catch a more precise threshold line, and it is possible to obtain a much higher accuracy than the related art because the threshold line can be caught according to the data type.
The results of determining chromosome abnormalities of autosomes and sex-chromosomes by the method for determining the chromosome abnormalities according to present invention are shown in Table 2 below. It can be seen that the results verified by the existing known standard experimental methods and the results determined by the method for determining the chromosome abnormalities according to present invention are the same as each other.
The method for determining the chromosome abnormalities according to the present invention is not limited to the sequencing method and the normalization method thereof by a specific automatic sequencing device in the related art. The method can be usefully used for prenatal diagnosis by using the generated sequence information, being applied to autosomes and sex-chromosomes, and early determining presence or absence of malformation due to abnormality of the number of fetal autosomes and sex-chromosomes based on a commercial application of a non-invasive method because as the number of diagnoses increases, accuracy and sensitivity increase.
In the method according to the present invention, when many sequencing data and abnormality determination data therefor are accumulated, it is possible to set a precise threshold line by a linear discriminant analysis (LDA) method, thereby obtaining the sensitivity much higher than that of the conventional method.
Number | Date | Country | Kind |
---|---|---|---|
10-2016-0007181 | Jan 2016 | KR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/KR2017/000741 | 1/20/2017 | WO | 00 |