The following description relates to a technique for detecting genomic structural variations.
Genomic variations may be largely divided into sequence variations and structural variations. Structural variations refer to genetic segmental duplication greater than or equal to 1000 base pairs (bp; length of nucleic acid), copy number variation, translocation, inversion, insertion, or deletion.
Recently, along with the development of next-generation sequencing (NGS), techniques for discovering structural variations using sequence fragments (reads) generated by a sequencing apparatus have been introduced. For sequence variation analysis, various efficient algorithms have emerged based on large-scale sequence data. On the other hand, structural variation prediction, which has much higher complexity, has no market-dominant algorithm or program in terms of performance and speed.
Prediction of structural variations in cancer and major diseases is clinically urgent. In particular, as medical insurance is applied to the use of cancer genome panels in Korea, next-generation sequence data is being produced from a large number of cancer patients. However, a technique for predicting or classifying cancer-related structural variations is not supported.
Conventional commercial genomic structural variation analysis programs have limitations in detecting various types of structural variations. For example, BreakDancer is limited in detecting an insertion type because a structural variation is predicted using only information on discordant paired-end reads. Furthermore, the conventional analysis programs have a problem (false positive or false negative) in which a sequence difference due to racial differences is misinterpreted as a sequence associated with a structural variation because genome sequence differences (SNP) between individuals are not considered.
The following description is intended to provide a technique of detecting all types of structural variations through NGS-based analysis. Also, the following description is intended to provide a technique of detecting genomic structural variations in consideration of a genomic sequence difference due to racial differences.
A method of detecting a genomic structural variation based on a multi-reference genome includes receiving sample sequence data by a computer apparatus, comparing, by the computer apparatus, the sample sequence data to multi-reference genome data to determine at least one k-mer read that is not included in the multi-reference genome among reads of the sample sequence data, determining, by the computer apparatus, a breakpoint and a candidate region of a structural variation by mapping the at least one k-mer read to standard reference genome data, and predicting, by the computer apparatus, a structural variation type for the sample sequence data on the basis of a sequence mapping pattern and the breakpoint corresponding to the mapping result.
An apparatus for detecting a genomic structural variation based on a multi-reference genome includes an input device configured to receive sample sequence data, a storage device configured to store multi-reference genome data, standard reference genome data, and a program for comparing the multi-reference genome data and the standard reference genome data to the sample sequence data and predicting a structural variation type for the sample sequence data, and a computing device configured to compare the multi-reference genome data and the sample sequence data to determine at least one k-mer read that is not included in the multi-reference genome among reads of the sample sequence data and configured to predict the structural variation type on the basis of a sequence mapping pattern and a breakpoint determined by mapping the at least one k-mer read to the standard reference genome data.
With the technique described below, it is possible to effectively detect various structural variations using a complex mapping technique. Also, with the technique described below, it is possible to solve an erroneous detection problem due to sequence differences between races by using a complex reference genome in the detection of the genomic structural variation. The technique described below is a genome analysis technique usable for NGS-based cancer diagnosis panels, whole-genome sequencing (WGS), whole-exome sequencing (WES), and targeted panel sequencing (TPS). Furthermore, with the technique described below, it is possible to detect NGS-based germ cell-related structural variations (hereditary) and somatic cell-related structural variations (non-hereditary).
As the following description may be variously modified and have several example embodiments, specific embodiments will be shown in the accompanying drawings and described in detail below. It should be understood, however, that there is no intent to limit the following description to the particular forms disclosed, but on the contrary, the following description is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.
Analysis techniques and terms used herein will be described below.
The NGS-based analysis includes a single-end library method and a paired-end library method. In general, the paired-end technique is more useful for discovering genomic structural variations because two sequence fragments of a sample genome specimen are mapped and compared to a reference genome sample.
A paired-end mapping (PEM)-based structural variation detection technique uses paired-end reads. Two paired reads generated in a genome (case) to be detected have information on a distance from each other. For reference, generally, for genome analysis, a patient group is marked “case,” and a normal group is marked “control.” When two reads are mapped to a reference genome whose sequence is already known, a structural variation is detected by computing the difference between an actual mapping distance to the reference genome and a distance to the case. In this case, since the reads are mapped to the reference genome in consideration of both forward and reverse directions, inversion detection is possible. PEM-based techniques that find and analyze paired reads supports much higher resolution than single-end mapping-based methods. The PEM-based structural variation detection technique analyzes a form in which two reads are mapped. The form or feature in which two reads are mapped may also be referred to as a signature. Genomic structure variations are detected using the types and mapping forms of such signatures.
It may be more effective to detect a structural variation using a plurality of signatures than to compute a location where a structural variation has occurred using one signature. A clustering technique classifies (clusters) a plurality of signatures and computes a location of a structural variation that is representative of one cluster. The clustering technique can improve the reliability of prediction by removing a portion that is accidentally mapped. At this time, locations of both ends where a variation has occurred are called breakpoints. The clustering technique may be classified into several techniques depending on a signature determination method and an actual breakpoint computation method. For example, the clustering technique includes a standard clustering approach, a soft clustering approach, and a distribution-based clustering approach.
There are also analysis methods different from the PEM technique. For example, there is a technique for detecting a structural variation based on depth of coverage (DOC). However, the DOC-based analysis method has difficulty in detecting a signature in a small area and has limitations in determining breakpoints.
Meanwhile, there are commercial programs that detect genomic structural variations based on NGS. For example, the programs include MoDIL, SeqSeq, PEMer, VariationHunter, Pindel, BreakDancer, ABI SOLiD software Tool, etc. The tools differ in terms of detectable signatures, a clustering method for detecting signatures, or a method of constructing and processing a window.
For convenience of description, it is assumed that the NGS-based genome analysis technique uses PEM. However, the method of detecting structural variations, which will be described below, is not limited to a specific genome analysis methodology.
Sample data, sample sequence data, or sample genome data refer to genome data of a target to be analyzed. For example, the sample sequence data may be genome data of a patient with a specific disease. The sample data may be genome data of a cancer patient (suspect). The sample sequence data is a result of the NGS apparatus analyzing the sequence. Accordingly, the sample sequence data has an NGS analysis data format. For example, the sample sequence data may be a file in a format such as “fastq.”
Reference data, reference sequence data, or reference genome data refer to data to be compared for analysis of the sample sequence data. A structural variation in the sample sequence data may be detected by comparing the difference between the sample sequence data and the reference genome data. The reference genome data is data prepared in advance through experimental results. As will be described later, there are pieces of reference genome data for various races. Also, the pieces of reference genome data differ from each other in terms of completeness. Reference genome data completed by many research institutes over a long period of time has a high degree of completeness. Here, the completeness may be a ratio (proportion) of a sequenced portion to the whole genome sequence. When there are many sequenced parts, it can be said that the degree of completion is relatively high. There is a piece of reference genome data having a degree of completion greater than or equal to a specific reference value. For example, here, the reference value may be 90%.
Standard genome data has a similar meaning to the reference genome data. However, the standard genome data is basically defined as single reference genome data published through research. For example, genome data such as hg19 may be standard genome data.
Multi-reference genome data is a reference genome data set constructed with a plurality of pieces of reference genome data. The multi-reference genome data may be constructed using comparative data (dbSNP, etc.) and filtering out analysis errors and reference genomes of various races. The multi-reference genome data will be described below.
The following description assumes that genomic structural variations are analyzed through a computer apparatus. A computer apparatus refers to a device that can calculate and process certain data, such as a personal computer (PC), a smart device, and a network server. The computer apparatus that analyzes genomic structural variations may be referred to as a structural variation detection apparatus. The computer apparatus and the structural variation detection apparatus will be described below. For convenience of description, the following description assumes that a computer apparatus performs each process of the analysis of the genomic structural variations.
First, the construction of the multi-reference genome data will be described. The multi-reference genome data should be prepared prior to the analysis of sample sequence data. The multi-reference genome data is prepared by the computer apparatus processing certain data.
(1) Basically, the multi-reference genome data includes reference genomes of a plurality of races. For example, the multi-reference genome data includes hg19, hg38, HuRef, NA12878, KOREF(1.0), AK1, YH(1.0), HX(1.1), Mongolian genome, Japanese genome(v2), and the like. The reference genome data of the plurality of races is intended to resolve interpretation errors occurring due to a sequence difference between races.
(2) Furthermore, the multi-reference genome data may further include dbSNP(INDEL), dbSNP(SNP), and reference genomes produced by users. dbSNP(INDEL) and dbSNP(SNP) are intended to resolve interpretation errors due to a sequence difference between individuals. The data may be referred to as data for filtering genomes.
The multi-reference genome data is constructed with a plurality of pieces of genome information, and a data structure for managing a plurality of pieces of genome data is necessary. To this end, the multi-reference genome data is composed of k-mers of the dbSNP data and the reference genomes of the plurality of races. Furthermore, the multi-reference genome data may be expressed as a hash table for a great deal of k-mers. For example, the multi-reference genome data may use, as a data structure, a hash table structure such as Sparsepp/KMC.
(3) Meanwhile, the multi-reference genome data may additionally use normal sequence data (NGS analysis result data of normal people). As the NGS analysis result, the normal sequence data may be data in a format such as fastq. When the hash table constructed with k-mers of the dbSNP data and the reference genomes of the plurality of races has normal sequence data, k-mers of the normal sequence data are included in the hash table. Here, k is a natural number of a certain size. For example, k may be 31.
The computer apparatus receives sample sequence data to be analyzed (120). The sample sequence data is an NGS analysis result. The sample sequence data may be in a format such as fastq. The sample sequence data may be a genome analysis result for a patient or a suspected patient (hereinafter referred to as a user). The sample sequence data includes sequence analysis data derived from a user's diseased tissue (e.g., a tumor). Also, the sample sequence data may include sequence analysis data derived from a user's blood. The sample sequence data may include all the sequence analysis data derived from the user's tissue and blood.
By using a hash table of the constructed multi-reference genome data, the computer apparatus determines whether a sample sequence data read is present in the hash table (130). This process may be referred to as a process of filtering the sample sequence data using the multi-reference genome data. The computer apparatus may determine that a k-mer read in the hash table among reads of the sample sequence data is a part having no structural variation (yes in 130). On the contrary, the computer apparatus may analyze the type of structural variation on the basis of a k-mer read that is not present in the hash table among the reads of the sample sequence data (no in 130).
The computer apparatus detects the k-mer read that is not included in the hash table among the reads of the sample sequence data (140). Among the reads of sample sequence data, the k-mer read that is not included in the hash table is hereinafter referred to as a target k-mer read.
Then, the computer apparatus compares the target k-mer read to other reference genome data (150). The computer apparatus maps the target k-mer read to standard reference data (150). In this case, the standard reference data may use one piece of reference genome data with a high degree of completeness. For example, hg19 or hg38 may be used as the standard reference data. Alternatively, when the user is of a specific race, reference data of the corresponding race may be used. For example, in the case of structural variation analysis for Korean, KOREF may be used as the standard reference data. In some cases, furthermore, the standard reference may be composed of one or more pieces of reference data. It is assumed that hg19, which is a piece of reference genome data with a relatively high degree of completeness is used.
The computer apparatus maps the target k-mer read to hg19. The computer apparatus predicts a structural variation type for a sample on the basis of a result of the mapping to the standard reference data (e.g., hp19) (160). The computer apparatus may calculate a breakpoint list by mapping the target k-mer read and the standard reference data. Also, the computer apparatus may calculate a sequence matching result (signature) by mapping the target k-mer read and the standard reference data. Finally, the computer apparatus may predict a structural variation type for the sample sequence data on the basis of the breakpoint list and a feature, form, or pattern (signature) of the sequence matching. A criterion for predicting the structural variation type using breakpoints and the sequence mapping result may be similar to conventional structural variation detection techniques. All of the structural variation types may be predicted using the breakpoints and the sequence mapping result.
In
The effects of this technique according to the present disclosure were compared to those of conventional predictive programs. The experiment results (data) for the structural variation detection technique of the present disclosure are indicated in black. The structural variation detection technique of the present disclosure uses a result of mapping to the standard reference genome after the k-mer filtering. The experiment results below are intended to determine whether an accurate structural variation type can be predicted through this process. It is verified whether it is possible to effectively detect various types of structural variations using the structural variation detection technique of the present disclosure. A total of 555 types of structural variations such as deletions, inversions, translocations, and duplications were used to conduct performance tests together with commercial programs.
The structural variation detection apparatus 200 includes a storage device 210, a memory 220, a computing device 230, an interface device 240, and a communication device 250.
The communication device 250 refers to a component for receiving and transmitting certain information through a wired or wireless network. The communication device 250 may receive sample sequence data, multi-reference genome data, or data for constructing multi-reference genome data (a plurality of pieces of reference genome data, dbSNP data, etc.) from an external object. The communication device 250 may receive certain data from a user terminal, an NGS analysis device, an NGS analysis server, etc. The communication device 250 may transmit structural variation type analysis results to a user terminal, a separate server, or the like.
The storage device 210 may store a program (code) for implementing the above-described structural variation analysis technique. The storage device 210 may store the multi-reference genome data, the sample sequencing data, etc. The memory 220 may store information received by the node apparatus 200 or data temporarily generated according to the operation of the computing device 230.
The interface device 240 is a device for receiving a certain instruction from an external user. The interface device 240 may receive a program or data basically required for operation of the node apparatus 200 from an input device or an external storage device that is physically connected to the interface device 240. For example, the interface device 240 may receive sample sequence data to be analyzed. Also, the interface device 240 may receive the multi-reference genome data. Also, the interface device 240 may receive various pieces of reference data to construct the multi-reference genome data.
The communication device 250 and the interface device 240 are devices that receive certain data or instructions from the outside. The communication device 250 and the interface device 240 may be referred to as input devices.
The computing device 230 may generate multi-reference genome data using data input from an input device or data stored in the storage device 210. The computing device 230 may compare the multi-reference genome data and the sample sequence data and determine at least one target k-mer read that is not included in the multi-reference genome data among reads of the sample sequence data. The computing device 230 may predict a structural variation type on the basis of breakpoints and a candidate region of the structural variation determined by mapping the at least one target k-mer read to the standard reference genome data. The computing device 230 may be a device for processing data and performing certain computations, such as a processor, an application processor (AP), and a chip with an embedded program.
The user may request the service server 350 to analyze genomic structural variations through a user terminal. The user may receive sample sequence data from a sample DB 330. The sample DB 330 stores an NGS analysis result for a specific user. The sample DB 330 may be an object located in a network. Alternatively, the sample DB 330 may be a simple storage medium. The user delivers the sample sequence data to the service server 350 through the user terminal 310. When receiving the analysis request including the sample sequence data, the service server 350 predicts a structural variation type for the sample sequence data through the above-described process. It is assumed that the service server 350 constructs the multi-reference genome data for analysis and acquires the standard reference genome data in advance. The service server 350 may receive the reference genome data from a reference genome DB 360. The service server 350 may receive SNP and INDEL data from dbSNP 370. The service server 350 may construct the multi-reference genome data using the dbSNP and a plurality of pieces of reference genome data by the above-described method. The service server 350 may transmit a generated structural variation analysis result to the user terminal 310. Alternatively, although not shown, the service server 350 may store the structural variation analysis result in a separate storage medium or may deliver the structural variation analysis result to a separate object.
In the NGS analysis process, the user may deliver the sample sequence data to the service server 350 through the user terminal 320. The user terminal 320 may receive the sample sequence data from the NGS analysis apparatus. When receiving the analysis request including the sample sequence data, the service server 350 predicts a structural variation type for the sample sequence data through the above-described process. It is assumed that the service server 350 constructs the multi-reference genome data for analysis and acquires the standard reference genome data in advance. The service server 350 may transmit a generated structural variation analysis result to the user terminal 320. Alternatively, although not shown, the service server 350 may store the structural variation analysis result in a separate storage medium or may deliver the structural variation analysis result to a separate object.
Also, the above-described genomic structural variation detection method may be implemented using a program (or application) including an executable algorithm that may be executed by a computer. The program may be stored and provided in a non-transitory computer-readable medium.
The non-transitory computer-readable medium refers a medium that semi-permanently stores data and is readable by a device rather than a medium that temporarily stores data such as a register, a cache, and a memory. Specifically, the above-described various applications or programs may be provided while being stored in a non-transitory computer-readable medium such as a compact disc (CD), a digital versatile disc (DVD), a hard disk, a Blu-ray disc, a Universal Serial Bus (USB), a memory card, a read-only memory (ROM), etc.
The above embodiments and drawings attached to the present specification are merely intended to clearly describe part of the technical spirit included in the present invention, and it is apparent that all modifications and detailed embodiments that can be easily derived by those skilled in the art within the scope of the technical spirit included in the specification and the drawings of the present invention are included in the scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
10-2018-0116410 | Sep 2018 | KR | national |
10-2018-0139875 | Nov 2018 | KR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/KR2018/014079 | 11/16/2018 | WO | 00 |