This application claims the benefit of Indian Patent Application No. 2510/CHE/2015, filed on May 19, 2015, in the Office of the Controller General of Patents, Designs, and Trademarks of India, and Korean Patent Application No. 10-2016-0025763, filed on Mar. 3, 2016, in the Korean Intellectual Property Office, the disclosures of which are incorporated herein in their entireties by reference.
The present disclosure relates to data compression in next generation sequencing (NGS) in general, and more particularly to a mechanism for generating a pileup file from compressed NGS genomic data.
2. Description of the Related Art
In computational biology, next generation sequencing (NGS) refers to new, high-throughput technology for sequencing DNA or RNA. NGS may be used to analyze the genome of an individual or a collection of individuals to, for example, comprehensively catalog genetic variation in population samples. The NGS-based diagnostics may have a significant impact on prescribing effective treatment to an individual. Such personalization is often based on a set of mutations obtained from analyzing an individual's DNA data through an NGS analysis pipeline. The mutations that characterize the individual's disease help clinicians tailor therapy to that individual. Typically, the NGS methods amplify the DNA molecule being sequenced and divide the replicates into smaller strands called reads (made up of few tens to few thousands of contiguous base pairs). These reads are sequenced and the output is stored in a FASTQ file (unaligned NGS sequence reads+quality data is stored in a FASTQ file).
To analyze the genomic data obtained through NGS sequencing, the reads are first aligned to a reference (indicates a reference standard of the genomic data, which may be understood as the genomic data that represents each species) and then stored in a Sequence Alignment Map (SAM) file. Corresponding to each read, the SAM file has multiple fields such as the read sequence, quality values, read-level quality value, alignment location relative to the reference and a Compact Idiosyncratic Gapped Alignment Report (CIGAR) string. The CIGAR string contains a presentation of differences between the read and the reference. The SAM file may range from several megabytes (MBs) to gigabytes (GBs) in size. Analysis of the genomic data requires steps such as variation calling, which requires a pileup file of the genomic data to be analyzed.
In general, analysis or processing of compressed genomic data such as pileup file generation and variation calling is performed on a binary SAM file called a Binary Alignment Map (BAM) file. The BAM file size may increase up to a few MBs to a few GBs. In the case that the SAM data is compressed, in order to perform variation calling and generate a pileup file, the compressed genomic data of the SAM file is first decompressed and then converted into the BAM format before invoking the pileup and variation calling. This consumes a large amount of memory space and processing power, thereby increasing time for genomic data analysis.
Provided are a method and device for generating a pileup file from a reference based compression file that compresses next generation sequencing (NGS) read information relative to a reference. The pileup file is generated by decompressing one or more reads from the reference based compression file. The decompression is partial decompression.
Also provided is a method of partially decompressing one or more reads by obtaining differential strings corresponding to each of the reads.
Also provided is a method of generating a pileup file by decoding the differential strings using one or more conversion rules.
Additional aspects will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the presented embodiments.
According to an embodiment, a method of generating a pileup file includes receiving a reference based compression file including a plurality of pieces of read data that are compressed; partially decompressing the plurality of pieces of read data to acquire a differential string associated with the plurality of pieces of read data; and generating the pileup file by decoding the differential string based on a plurality of conversion rules.
According to another embodiment, an apparatus for generating a pileup file includes a memory configured to store at least one instruction; and a processor configured to execute the at least one instruction stored in the memory. The processor is configured to receive a reference based compression file including a plurality of pieces of read data that are compressed, partially decompress the plurality of pieces of read data to acquire a differential string associated with the plurality of pieces of read data, and generate the pileup file by decoding the differential string based on a plurality of conversion rules.
According to yet another embodiment, a non-transitory computer-readable recording medium having recorded thereon a program, which, when executed by a computer, performs the method of generating a pileup file, which includes receiving a reference based compression file including a plurality of pieces of read data that are compressed; partially decompressing the plurality of pieces of read data to acquire a differential string associated with the plurality of pieces of read data; and generating the pileup file by decoding the differential string based on a plurality of conversion rules.
These and/or other aspects will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings in which:
Terms used herein are selected as general terms used currently as widely as possible considering the functions in the present disclosure, but they may depend on the intentions of one of ordinary skill in the art, legal practice, the appearance of new technologies, etc. In some cases, terms arbitrarily selected by the applicant are also used, and in such cases, their meaning will be described in detail. Thus, it should be noted that the terms used in the specification should be understood not based on their literal names but by their given definitions and descriptions as used throughout the specification.
It will be further understood that the terms “comprises” and/or “comprising” used herein specify the presence of stated features or components, but do not preclude the presence or addition of one or more other features or components. In addition, the terms such as “unit,” “-er(-or),” and “module” described in the specification refer to an element for performing at least one function or operation, and may be implemented in hardware, software, or the combination of hardware and software.
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout. In this regard, the present embodiments may have different forms and should not be construed as being limited to the descriptions set forth herein. Accordingly, the embodiments are merely described below, by referring to the figures, to explain aspects. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list.
In the embodiments below, a method and device for generating a pileup file (pileup string) from a reference-based compression file will be described. The reference based compression file is typically based on a reference based next generation sequencing (NGS) data compression, in which sequencing is a process of determining the nucleotide order of a given DNA string. The reference-based compression file includes a plurality of pieces of NGS read data (reads). The NGS read data is represented as a differential string relative to a reference sequence. The read data includes spliced overlapping fragments of an amplified DNA strand or string that is to be sequenced. After sequencing, the read data is aligned to matching locations on the reference. The differential string provides difference information of the read with respect to the reference sequence where the read data aligns. The differential string is encoded along with other components of a Sequence Alignment Map (SAM) file for the read data which include, but are not limited to, quality values associated with a base of the read data as well as quality values (quality vectors) of the read data. The read data is completely defined by the reference sequence and the differential string in the reference based compression file using the reference-based NGS data compression.
A method of generating the pileup file according to an embodiment includes obtaining the differential string corresponding to the read data. The generating of the pileup file from the differential string may be performed using one or more conversion rules. Also, the generated pileup file may be used for a plurality of applications. For example, the pileup file may be used in variation calling.
Current pileup file generation methods require complete decompression or reconstruction of compressed read data in a SAM file and conversion to a stored BAM format. However, the method and device of generating the pileup file according to an embodiment do not require the complete decompression or reconstruction of the read data. The pileup file is generated by decompressing compression information to obtain a differential string for every read from a reference based compression file. The differential string may be obtained by partial decompression of read data. Compared to complete decompression or reconstruction of the read data, partial decompression of the read data may provide higher time efficiency for pileup file generation and reduce space complexity.
In an embodiment, usage of a reference-based compression mechanism that compresses the entire genome with random access may provide partial decompression of selective regions only. Also, even when the entire genome data has to be accessed from the reference based compression file, less memory may be utilized in comparison to current methods that uses an equivalent BAM file. This is due to efficient compression and partial decompression of the read data.
Hereinafter, the present embodiments will be described with reference to
In an embodiment, the network 104 may be a wireless network, a wired network, or a combination thereof. The network 104 may be implemented as one of different types of networks, such as an intranet, a local area network (LAN), a wide area network (WAN), the Internet, and the like.
The device 102 may generate the pileup file by partially decompressing the plurality of pieces of read data included in the reference-based compression file. Also, the device 102 may generate the pileup file by decoding the differential string of each piece of the read data based on a plurality of conversion rules. The conversion rules will be described later with reference to
The generated pileup file may be used for a plurality of applications including, but not limited to, variation calling.
The device 102 according to an embodiment may generate the pileup file based on partial decompression of the reads by performing the method of generating the pileup file including operations 302 to 306.
Various operations, actions, blocks, steps, and the like of the method of generating the pileup file may be performed in the aforementioned order, in a different order, or simultaneously. According to another embodiment, some operations, actions, blocks, steps, and the like may be omitted, added, modified, and the like without departing from the scope of the inventive concept.
The reference 402a is used as a reference sequence for the read 404a to generate a reference-based compression file. Difference information between the reference 402a and the read 404a is stored in the reference-based compression file as the differential string 408a in addition to other read related information. In the reference-based NGS data compression, difference information may be encoded, thereby saving a considerable amount of memory. A position of a difference may be encoded using differential offsets. A type of variation between the read and the reference may be encoded followed by a nucleotide sequence (except in the case of a deletion operation). An entropy coder (e.g., an arithmetic coder) may be used for compressing each of the above parameters. The quality values may be compressed using methods relevant to compressing a set of symbols such as those used in compression of unaligned NGS data (e.g., in a FASTQ file).
The differential string that represents a difference between the read and the reference sequence may include indicators including substitution “S,” insertion “I,” deletion “D,” and soft-clipping “@.” According to an embodiment, the differential string 408a may be “O@AAA OSC 3SA 0IAT OSCG OD2.”
The device 102 according to an embodiment may generate the pileup file by using the conversion rules to decode the differential string. The conversion rules according to an embodiment will be described below with reference to Table 2. The fields of the pileup file may include a position field (indicates a position in relation to the reference), a reference field, a read base information field, and a quality information field.
Also, in a second sub step, the segment “3SA” indicates that a base C is substituted with a base A at a position 1028 on the reference 402a. In the second sub step, the device 102 may insert a quality value (e.g., q) of a substituted base (e.g., base A) in the quality information field. Also, the device 102 may mark the internal state as ‘S,’ and change the position on the reference from 1028 to 1029.
For the segment “0D2” of the differential string 408a, the first way may be used because a substitution has been previously performed. The device 102 may identify reference bases (in this case, A and A) at the position 1031 and the position 1032. According to the first way, the device 102 may insert −2AA in a read base information field corresponding to the previous position 1030. Also, the device 102 may insert “*” in read base information fields corresponding to the positions 1031 and 1032, and insert “!” in quality information fields corresponding to the positions 1031 and 1032. Also, the position on reference may be increased from 1031 by the number of deleted bases, i.e., two, and changed to 1033, and the internal state may be marked as “D.”
In an embodiment, the device 102 may generate the pileup string 402b by performing the first to tenth steps described with reference to
Also, the processing unit 508 may process instructions of an algorithm. The processing unit 508 may receive commands from the controller 504 to perform processing. Also, any logical and arithmetic operations involved in the execution of the instructions may be performed by the ALU 506.
The computing device 500 according to an embodiment may include a plurality of homogenous and/or heterogeneous cores, different types of central processing units (CPUs), special media, and other accelerators. The processing unit 508 may be implemented as a single chip or a plurality of chips, but is not limited thereto.
An algorithm including instructions and codes required for execution may be stored in the memory 510, the storage unit 512, or both. At the time of execution, the instructions may be fetched from the memory 510 or the storage unit 512, and executed by the processing unit 508.
In the case of random hardware devices, the computing device 500 may be connected to various devices via the network interface 516. Commands may be received from a user or processing results of the computing device 500 may be transmitted to the user via the I/O interface 514.
The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the various components or elements. The components shown in the
The method of generating the pileup file according to the embodiments may be embodied as computer-readable code on a computer-readable recording medium. The computer-readable recording medium is any data storage device that can store programs or data which can be thereafter read by a computer system. Examples of the computer-readable recording medium include read-only memory (ROM), random-access memory (RAM), CD-ROMs, magnetic tapes, hard disks, floppy disks, flash memory, optical data storage devices, and the like. The computer-readable recording medium can also be distributed over network coupled computer systems so that the computer-readable code is stored and executed in a distributive manner.
It should be understood that embodiments described herein should be considered in a descriptive sense only and not for purposes of limitation. Descriptions of features or aspects within each embodiment should typically be considered as available for other similar features or aspects in other embodiments.
While one or more embodiments have been described with reference to the figures, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope as defined by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
2510/CHE/2015 | May 2015 | IN | national |
10-2016-0025763 | Mar 2016 | KR | national |