This disclosure provides a novel method of representation of genome sequencing data which reduces the utilized storage space and improves access performance by providing new functionality that are not available with known prior art methods of representation.
An appropriate representation of genome sequencing data is fundamental to enable efficient genomic analysis applications such as genome variants calling and all other analysis performed with various purposes by processing the sequencing data and metadata.
Human genome sequencing has become affordable by the emergence of high-throughput low cost sequencing technologies. Such opportunity opens new perspectives in several fields ranging from the diagnosis and treatment of cancer to the identification of genetic illnesses, from pathogen surveillance for the identification of antibodies to the creation of new vaccines, drugs and customization of personalized treatments.
Hospitals, genomic analysis providers, bioinformatics and large biological data storage centers are looking for affordable, fast, reliable and interconnected genomic information processing solutions which could enable scaling genomic medicine to a world-wide scale. Since one of the bottleneck in the sequencing process has become data storage, methods for representing genome sequencing data in a compressed form are increasingly investigated.
The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).
A more sophisticated approach to genomic data compression that is less used, but more efficient than BAM is CRAM. CRAM provides more efficient compression for the adoption of differential encoding with respect to an existing reference (it partially exploits the data source redundance), but it still lacks features such as incremental updates, support for streaming and selective access to specific classes of compressed data.
These approaches generate poor compression ratios and data structures that are difficult to navigate and manipulate once compressed. Downstream analysis can be very slow due to the necessity of handling large and rigid data structures even to perform simple operation or to access selected regions of the genomic dataset. CRAM relies on the concept of the CRAM record. Each CRAM record encodes a single mapped or unmapped reads by encoding all the elements necessary to reconstruct it.
CRAM has the following drawbacks:
1. For CRAM, data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it's implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded bit stream.
2. In CRAM all core data blocks can contain any type of mapped reads (perfectly matching reads, reads with substitutions only, reads with insertions or deletions (also referred to as “indels”)). There is no notion of classification and grouping of reads in classes according to the result of mapping with respect to a reference sequence
3. In the present invention there is no notion of record encapsulating each read because the data needed to reconstruct each read is scattered among several data containers called “layers”. This enables more efficient access to set of reads with specific biological characteristics (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) without the need of decoding each (block of) read(s) to inspect its features.
4. In a CRAM record each type of data is denoted by a specific flag. In the present invention there is no notion of flag denoting data because this is intrinsically defined by the “layer” the data belongs to. This implies a largely reduced number of symbols to be used and a consequent reduction of the information source entropy which results into a more efficient compression. This is due to the fact that the use of different “layers” enables the encoder to reuse the same symbol across each layer with different meanings. In CRAM each flag must always have the same meaning as there is no notion of contexts and each CRAM record can contain any type of data.
5. In CRAM substitutions, insertions and deletions are expressed according to different syntaxes, while the proposed approach uses a single alphabet and encoding for substitutions, insertions and deletions. This makes the encoding and decoding process simpler and produces a lower entropy source model which coding yields high compression bitstreams.
The present invention aims at compressing genomic sequences by organizing and partitioning data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are enabled.
One of the aspects of the presented approach is the definition of classes of data and metadata to be encoded separately and to be structured in different layers. The most important improvements of this approach with respect to existing methods consist in:
1. an increase of compression performance due to the reduction of the information source entropy constituted by providing an efficient model for each class of data or metadata;
2. the possibility of performing selective accesses to portions of the compressed data and metadata for any further processing purpose;
3. the possibility to incrementally (without the need of re-encoding) update encoded data and metadata with new sequencing data and/or metadata and/or new analysis results.
The features of the independent claims below solve the problem of existing prior art solutions by providing a method for classification of genome sequences and a method for compression using said classification. In one aspect, a method for the classification of genome sequence data produced by a sequencing machine, said genome sequence data comprising sequences of nucleotides “bases”, said classification being performed according to a reference sequence,
said method comprising the steps of:
identifying class P sequences, comprising matching regions in the reference sequence without mismatches;
identifying class N sequences, comprising matching regions in the reference sequence with a number of mismatches represented by positions where the sequencing machine was not able to call any “base”;
identifying class M sequences, comprising matching regions in the reference sequence with a number of mismatches represented by positions where the sequencing machine was not able to call any base or it called a different base than the reference sequence;
identifying class I sequences, comprising the same mismatches of class M plus the presence of insertions or deletions;
identifying class U sequences comprising all reads that do not find any valid mapping on the reference sequence.
In another aspect, a method for the compression of genome sequence data produced by a sequencing machine,
said genome sequence data comprising sequences of nucleotides,
said method comprising the steps of:
aligning said reads to a reference sequence thereby creating aligned reads;
classifying said aligned reads according to a multiplicity of matching accuracy degrees with the reference sequence thereby creating classes of aligned reads;
encoding said aligned reads as layers of syntax elements;
wherein said syntax elements are selected according to said classes of aligned reads.
In another aspect, a method for the decompression of a compressed genomic stream, said method comprising the steps of:
parsing said compressed genomic stream into genomic layers of syntax elements,
expanding said genomic layers into classified reads of sequences of nucleotides,
selectively decoding said classified reads of sequences of nucleotides with reference to one or more reference sequences so as to produce uncompressed reads of sequences of nucleotides.
A further aspect, a genomic encoder 2010 for the compression of genome sequence data 209, said genome sequence data 209 comprising reads of sequences of nucleotides, said genomic encoder 2010 comprising:
an aligner unit 201, configured to align said reads to one or more reference sequences thereby creating aligned reads,
a data classification unit 204, configured to classify said aligned reads according to matching accuracy degrees with the one or more reference sequences thereby creating classes of aligned reads;
one or more layers encoding units 205-207, configured to encode said classified aligned reads as layers of syntax elements by selecting said syntax elements according to said classes of aligned reads.
In another aspect, a genomic decoder 218 for the decompression of a compressed genomic stream 211 said genomic decoder 218 comprising:
parsing means 210, 212-214 configured to parse said compressed genomic stream into genomic layers of syntax elements 215,
one or more layer decoders 216-217, configured to decode the genomic layers into classified reads of sequences of nucleotides 2111,
genomic data classes decoders 213 configured to selectively decode said classified reads of sequences of nucleotides on one or more reference sequences so as to produce uncompressed reads of sequences of nucleotides.
The genomic or proteomic sequences referred to in this invention include, for example, and not as a limitation, nucleotide sequences, Deoxyribonucleic acid (DNA) sequences, Ribonucleic acid (RNA), and amino acid sequences. Although the description herein is in considerable detail with respect to genomic information in the form of a nucleotide sequence, it will be understood that the methods and systems for compression can be implemented for other genomic or proteomic sequences as well, albeit with a few variations, as will be understood by a person skilled in the art.
Genome sequencing information is generated by High Throughput Sequencing (HTS) machines in the form of sequences of nucleotides (a. k. a. bases) represented by strings of letters from a defined vocabulary. The smallest vocabulary is represented by five symbols: {A, C, G, T, N} representing the 4 types of nucleotides present in DNA namely Adenine, Cytosine, Guanine, and Thymine. In RNA Thymine is replaced by Uracil (U). N indicates that the sequencing machine was not able to call any base and so the real nature of the position is undetermined. In case the IUPAC ambiguity codes are adopted by the sequencing machine, the alphabet used for the symbols is (A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, N or -).
The nucleotides sequences produced by sequencing machines are called “reads”. Sequence reads can be between a few dozens to several thousand nucleotides long. Some technologies produce sequence reads in pairs where one read is from one DNA strand and the second is from the other strand. In genome sequencing the term coverage is used to express the level of redundancy of the sequence data with respect to a reference sequence. For example, to reach a coverage of 30× on a human genome (3.2 billion bases long) a sequencing machine shall produce a total of 30×3.2 billion bases so that in average each position in the reference is “covered” 30 times.
Throughout this disclosure, a reference sequence is any sequence on which the nucleotides sequences produced by sequencing machines are aligned/mapped. One example of sequence could actually be a reference genome, a sequence assembled by scientists as a representative example of a species' set of genes. For example GRCh37, the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, N.Y. However, a reference sequence could also consist of a synthetic sequence conceived to merely improve the compressibility of the reads in view of their further processing.
Sequencing devices can introduce errors in the sequence reads such as
1. Use of a wrong symbol (i.e. representing a different nucleic acid) to represent the nucleic acid actually present in the sequenced sample; this is usually called “substitution error” (mismatch);
2. Insertion in one sequence read of additional symbols that do not refer to any actually present nucleic acid; this is usually called “insertion error”;
3. Deletion from one sequence read of symbols that represent nucleic acids that are actually present in the sequenced sample; this is usually called “deletion error”;
4. Recombination of one or more fragments into a single fragment which does not reflect the reality of the originating sequence;
The term “coverage” is used in literature to quantify the extent to which a reference genome or part thereof can be covered by the available sequence reads. Coverage is said to be:
This invention aims at defining a genomic information representation format where the relevant information is efficiently accessible and transportable and the weight of the redundant information is reduced.
The main aspects of the disclosed invention are:
2 The decomposition of the sequence read data and metadata into homogeneous layers of in order to obtain distinct information sources with reduced information entropy.
3 The possibility of modeling each separated source with distinct source model adapted to each statistical characteristics including the possibility of changing the source model within each class of reads and layer for each accessible data units (access units). Adoption of the appropriate context adaptive probability models and associated entropy coders according to the statistical properties of each source model.
4 The definition of correspondences and dependencies among the layers to enable selective access to the data without the need to decode all the layers if not all information is needed
5 Coding each sequence data class and associated metadata layers with respect to a reference sequence that can be modified so as to reduce the entropy of data classes and layers information sources. After a first encoding based on a reference, sequence the detected mismatches can be used to “adapt/modify” the reference sequence in order to further reduce the overall information entropy. This process that can be performed iteratively as long as the reduction of information entropy results relevant.
In the following, each of the above aspects will be further described.
Main File Header
Classification of the Sequence Reads
The sequence reads generated by sequencing machines are classified by the disclosed invention into five different “classes” according to the results of the alignment with respect to one or more given reference sequences.
When aligning a DNA sequence of nucleotides with respect to a reference sequence five are the possible results:
The remaining unmapped reads with respect to a reference sequence can be assembled into a single sequence using de-novo assembly algorithms. Once a newly assembled reference sequence has been created unmapped reads can be further mapped with respect to it and be classified in one of the 4 classes P, N, M and I.
Decomposition of the Information Necessary to Represent Sequence Reads into Layers of Descriptors Once the classification of reads is completed with the definition of the Classes, further processing consists in defining a set of distinct syntax elements which represent the remaining information enabling the reconstruction of the DNA read sequence when represented as being mapped on a given reference sequence. The data structure of these syntax elements requires the storage of global parameters and metadata to be used by the decoding engine. These data are structured in a main header described in the table below.
A DNA segment referred to a given reference sequence can be fully expressed by:
This classification creates groups of descriptors (syntax elements) that can be used to univocally represent genome sequence reads. The table below summarizes the syntax elements needed for each class of aligned reads.
Reads belonging to class P are characterized and can be perfectly reconstructed by only a position, a reverse complement information and an offset between mates in case they have been obtained by a sequencing technology yielding mated pairs, some flags and a read length.
The next section further details how these descriptors are defined.
Position Descriptor Layer
In the position (pos) layer only the mapping position of the first encoded read is stored as absolute value on the reference sequence. All the other position descriptors assume a value expressing the difference with respect to the previous position. Such modeling of the information source defined by the sequence of read position descriptors is in general characterized by a reduced entropy particularly for sequencing processes generating high coverage results.
For example,
With high coverages (>50×) most of the descriptors of the position vector will present very high occurrences of low values such as 0 and 1 and other small integers.
Reverse Complement Descriptor Layer
Each read of the read pairs produced by sequencing technologies can be originated from either genome strands of the sequenced organic sample. However, only one of the two strands is used as reference sequence.
When the strand 1 is used as reference sequence, read 2 can be encoded as reverse complement of the corresponding fragment on strand 1. This is shown in
In case of coupled reads, four are the possible combinations of direct and reverse complement mate pairs. This is shown in
The same encoding is used for the reverse complement information of reads belonging to classes N, M, P and I.
In order to enable selective access to the different data classes, the reverse complement information of reads belonging to the four classes are encoded in different layers as depicted in Table 2.
Pairing Information Descriptor Layer
The pairing descriptor is stored in the pair layer. Such layer stores descriptors encoding the information needed to reconstruct the originating reads pairs when the employed sequencing technology produces reads by pairs.
Although at the date of the disclosure of the invention the vast majority of sequencing data is generated by using a technology generating paired reads, it is not the case of all technologies. This is the reason for which the presence of this layer is not necessary to reconstruct all sequencing data information if the sequencing technology of the genomic data considered does not generate paired reads information.
Definitions:
The pair descriptor layer is the vector of pairing errors calculated as number of reads to be skipped to reach the mate pair of the first read of a pair with respect to the defined decoding pairing distance.
The same descriptors are used for the pairing information of reads belonging to classes N, M, P and I. In order to enable the selective access to the different data classes, the pairing information of reads belonging to the four classes are encoded in different layer as depicted in.
Pairing Information in Case of Reads Mapped on Different Reference Sequences
In the process of mapping sequence reads on a reference sequence it is not uncommon to have the first read in a pair mapped on one reference sequence (e.g. chromosome 1) and the second on a different reference sequence (e.g. chromosome 4). In this case the pairing information described above has to be integrated by additional information related to the reference sequence used to map one of the reads. This is achieved by coding
1. A reserved value (flag) indicating that the pair is mapped on two different sequences (different values indicate if read1 or read2 are mapped on the sequence that is not currently encoded)
2. An unique reference identifier referring to the reference identifiers encoded in the main header structure as described Table 1.
3. The third element contains the mapping information on the reference identified at point 2 and expressed as offset with respect to the last encoded position.
The
In
1) One special reserved value is encoded as pairing distance (in this case 0xffffff)
2) A second descriptor provides a reference ID as listed in the main header (in this case 4)
3) The third element contains the mapping information on the concerned reference (170).
Mismatch Descriptors for Class N Reads
Class N includes all reads in which only mismatches constituted by “N” are present at the place of an A, C, G or T base call. All other bases of the read perfectly match the reference sequence.
the positions of “N” in read 1 are coded as
In the n mis layer, the coding of each reads pair is terminated by a special “separator” symbol.
Descriptors Coding Substitutions (Mismatches or SNPs), Insertions and Deletions
A substitution is defined as the presence, in a mapped read, of a different nucleotide base with respect to the one that is present in the reference sequence at the same position.
Source model 1: Substitutions as Positions and Types
Substitutions Positions Descriptors
A substitution position is calculated like the values of the n mis layer, i.e.
In read 1 substitutions are encoded
In the snpp layer, the coding of each reads pair is terminated by a special “separator” symbol.
Substitutions Types Descriptors
For class M (and I as described in the next sections), mismatches are coded by an index (moving from right to left) from the actual symbol present in the reference to the corresponding substitution symbol present in the read {A, C, G, T, N, Z}. For example if the aligned read presents a C instead of a T which is present at the same position in the reference, the mismatch index will be denoted as “4”. The decoding process reads the encoded syntax element, the nucleotide at the given position on the reference and moves from left to right to retrieve the decoded symbol. E.g. a “2” received for a position where a G is present in the reference will be decoded as “N”.
In case of adoption of the IUPAC ambiguity codes the substitution mechanism results to be exactly the same however the substitution vector is extended as: S={A, C, G, T, N, Z, M, R, W, S, Y, K, V, H, D, B}.
Some examples of substitutions encoding when IUPAC ambiguity codes are adopted are provided in
A further example of substitution indexes is provided in
Cooding of Insertions and Deletions
For class I, mismatches and deletions are coded by an indexes (moving from right to left) from the actual symbol present in the reference to the corresponding substitution symbol present in the read: {A, C, G, T, N, Z}. For example if the aligned read presents a C instead of a T present at the same position in the reference, the mismatch index will be “4”. In case the read presents a deletion where a A is present in the reference, the coded symbol will be “5”. The decoding process reads the coded syntax element, the nucleotide at the given position on the reference and moves from left to right to retrieve the decoded symbol. E.g. a “3” received for a position where a G is present in the reference will be decoded as “Z”.
Inserts are coded as 6, 7, 8, 9, 10 respectively for inserted A, C, G, T, N.
In this case the insertion codes need to have different values, namely 16, 17, 18, 19, 20 in case the substitution vector has 16 elements. The mechanism is illustrated in
Source Model 2: One Layer Per Substitution Type and Indels
For some data statistics a different coding model from the one described in the previous section can be developed for substitutions and indels resulting into a source with lower entropy. Such coding model is an alternative to the techniques described above for mismatches only and for mismatches and indels.
In this case one data layer is defined for each possible substitution symbol (5 without IUPAC codes, 16 with IUPAC codes), plus one layer for deletions and 4 more layers for insertions. For simplicity of the explanation, but not as a limitation for the application of the model, the following description will focus on the case where no IUPAC codes are supported.
Encoding Additional Signaling Flags
Each data class introduced above (P, M, N, I) may require the encoding of additional information on the nature of the encoded reads. This information may be related for example to the sequencing experiment (e.g. indicating a probability of duplication of one read) or can express some characteristic of the read mapping (e.g. first or second in pair). In the context of this invention this information is encoded in a separate layer for each data class. The main advantage of such approach is the possibility to selectively access this information only in case of need and only in the required reference sequence region. Other examples of the use of such flags are:
Adaptation of the Reference Sequences
The mismatches encoded for classes N, M and I can be used to create “modified references” to be used to re-encode reads in the N, M or I layer (with respect to the first reference sequence, R0) as p reads with respect to the “adapted” genome R1. For example if we denote with r_in̂M the ith read of class M containing mismatches with respect to the reference genome n, then after “adaptation” we could have r_in̂M=r_(i(n+1))̂P with A(Refn)=Refn+1 where A is the transformation from reference sequence n to reference sequence n+1.
RS2=A(RS1)
If the expression of transformation A which goes from RS1 to RS2 requires less bits of the expression of the mismatches present in the M reads, this encoding method results in a smaller information entropy and therefore better compression.
Source Models, Entropy Coders and Coding Modes
For each layer of the genomic data structure disclosed in this invention different coding algorithms may be adopted according to the specific features of the data or metadata carried by the layer and its statistical properties. The “coding algorithm” has to be intended as the association of a specific “source model” of the descriptor with a specific “entropy coder”. The specific “source model” can be specified and selected to obtain the most efficient coding of the data in terms of minimization of the source entropy. The selection of the entropy coder can be driven by coding efficiency considerations and/or probability distribution features and associated implementation issues. Each selection of a specific coding algorithm will be referred to as “coding mode” applied to an entire “layer”.
Each “source model” associated to a coding mode is characterized by:
Further Advantages
This classification permits the implementation of efficient coding modes exploiting the lower information source entropy characterizing by modelling the sequences of syntax elements by single separate data sources (e.g. distance, position, etc.).
Another advantage of the invention is the possibility to access only the subset of type of data of interest. For example one of the most important application in genomics consists in finding the differences of a genomic sample with respect to a reference (SNV) or a population (SNP). Today such type of analysis requires the processing of the complete sequence reads whereas by adopting the data representation disclosed by the invention the mismatches are already isolated into one to three data classes only (depending on the interest in considering N codes and indels).
A further advantage is the possibility of performing efficient transcoding from data and metadata compressed with reference to a specific “reference sequence” to another “reference sequence” when a new “reference sequence” is published or when re-mapping is performed on the already mapped data (e.g. using a different mapping algorithm).
Class decoders 219 are able to reconstruct the original genomic sequences by leveraging the information on the original reference sequences carried by one or more genomic streams. In case the reference sequences are not transported by the genomic streams they must be available at the decoding side and accessible by the class decoders.
The inventive techniques herewith disclosed may be implemented in hardware, software, firmware or any combination thereof. When implemented in software, these may be stored on a computer medium and executed by a hardware processing unit. The hardware processing unit may comprise one or more processors, digital signal processors, general purpose microprocessors, application specific integrated circuits or other discrete logic circuitry.
The techniques of this disclosure may be implemented in a variety of devices or apparatuses, including mobile phones, desktop computers, servers, tablets and similar devices.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2016/074307 | 10/11/2016 | WO | 00 |