This disclosure provides a novel method of representation of genome sequencing data which reduces the utilized storage space and improves access performance by providing new functionality that are not available with known prior art methods of representation.
An appropriate representation of genome sequencing data is fundamental to enable efficient genomic analysis applications such as genome variants calling and all other analysis performed with various purposes by processing the sequencing data and metadata.
Human genome sequencing has become affordable by the emergence of high-throughput low cost sequencing technologies. Such opportunity opens new perspectives in several fields ranging from the diagnosis and treatment of cancer to the identification of genetic illnesses, from pathogen surveillance for the identification of antibodies to the creation of new vaccines, drugs and the customization of personalized treatments.
Hospitals, genomics data analysis providers, bioinformaticians and large biological data storage centers are looking for affordable, fast, reliable and interconnected genomic information processing solutions which would enable scaling genomic medicine to a world-wide scale. Since one of the bottleneck in the sequencing process has become data storage, methods for representing genome sequencing data in a compressed form are increasingly investigated. The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).
A more sophisticated approach to genomic data compression that is less used, but more efficient than BAM is CRAM. CRAM provides more efficient compression for the adoption of differential encoding with respect to a reference (it partially exploits the data source redundancy), but it still lacks features such as incremental updates, support for streaming and selective access to specific classes of compressed data.
These approaches generate poor compression ratios and data structures that are difficult to navigate and manipulate once compressed. Downstream analysis can be very slow due to the necessity of handling large and rigid data structures even to perform simple operation or to access selected regions of the genomic dataset. CRAM relies on the concept of the CRAM record. Each CRAM record represents a single mapped or unmapped reads by coding all the elements necessary to reconstruct it.
CRAM presents the following drawbacks and limitations that are solved and overcome by the invention described in this document:
1. CRAM does not support data indexing and random access to data subsets sharing specific features. Data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it is implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded (i.e. compressed) bit stream.
2. CRAM is built by core data blocks that can contain any type of mapped reads (perfectly matching reads, reads with substitutions only, reads with insertions or deletions (also referred to as “indels”)). There is no notion of data classification and grouping of reads in classes according to the result of mapping with respect to a reference sequence. This means that all data need to be inspected even if only reads with specific features are searched. Such limitation is solved by the invention by classifying and partitioning data in classes before coding.
3. CRAM is based on the concept of encapsulating each read into a “CRAM record”. This implies the need to inspect each complete “record” when reads characterized by specific biological features (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) are searched.
Conversely, in the present invention there is the notion of data classes coded separately in separated information blocks and there is no notion of record encapsulating each read. This enables more efficient access to set of reads with specific biological characteristics (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) without the need of decoding each (block of) read(s) to inspect its features.
4. In a CRAM record each record field is associated to a specific flag and each flag must always have the same meaning as there is no notion of context since each CRAM record can contain any different type of data. This coding mechanism introduces redundant information and prevents the usage of efficient context based entropy coding.
Instead in the present invention, there is no notion of flag denoting data because this is intrinsically defined by the information “block” the data belongs to. This implies a largely reduced number of symbols to be used and a consequent reduction of the information source entropy which results into a more efficient compression. Such improvement is possible because the use of different “blocks” enables the encoder to reuse the same symbol across each block with different meanings according to the context. In CRAM each flag must always have the same meaning as there is no notion of contexts and each CRAM record can contain any type of data.
5. In CRAM substitutions, insertions and deletions are represented by using different descriptors, option that increases the size of the information source alphabet and yields a higher source entropy. Conversely, the approach of the disclosed invention uses a single alphabet and encoding for substitutions, insertions and deletions. This makes the encoding and decoding process simpler and produces a lower entropy source model which coding yields bitstreams characterized by high compression performance.
The present invention aims at compressing genomic sequences by classifying and partitioning sequencing data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are directly enabled in the compressed domain.
One of the aspects of the presented approach is the definition of classes of data and metadata structured in different blocks and encoded separately. The more relevant improvements of such approach with respect to existing methods consist in:
1. the increase of compression performance due to the reduction of the information source entropy constituted by providing an efficient source model for each class of data or metadata;
2. the possibility of performing selective accesses to portions of the compressed data and metadata for any further processing purpose directly in the compressed domain;
3. the possibility to incrementally (i.e. without the need of decoding and re-encoding) update compressed data and metadata with new sequencing data and/or metadata and/or new analysis results associated to specific sets of sequencing reads.
The features of the claims below solve the problem of existing prior art solutions by providing
A method for encoding genome sequence data, said genome sequence data comprising reads of sequences of nucleotides, said method comprising the steps of:
aligning said reads to one or more reference sequences thereby creating aligned reads, classifying said aligned reads according to specified matching rules with said one or more reference sequences, thereby creating classes of aligned reads, encoding said classified aligned reads as a multiplicity of blocks of descriptors, wherein encoding said classified aligned reads as a multiplicity of blocks of descriptors comprises selecting said descriptors according to said classes of aligned reads, structuring said blocks of descriptors with header information thereby creating successive Access Units.
In another aspect the coding method further comprises further classifying said reads that do not satisfy said specified matching rules into a class of unmapped reads constructing a set of reference sequences using at least some unmapped reads aligning said class of unmapped reads to the set of constructed reference sequences encoding said classified aligned reads as a multiplicity of blocks of descriptors, encoding said set of constructed reference sequences structuring said blocks of descriptors and said encoded reference sequences with header information thereby creating successive Access Units.
In another aspect the coding method further comprises identifying genomic reads without any mismatch in the reference sequence as first “Class P”
In another aspect the coding method further comprises identifying genomic reads as a second “Class N” when mismatches are only found in the positions where the sequencing machine was not able to call any “base” and the number of mismatches in each read does not exceed a given threshold.
In another aspect the coding method further comprises identifying genomic reads as a third “Class M” when mismatches are found in the positions where the sequencing machine was not able to call any “base”, named “n type” mismatches, and/or it called a different “base” than the reference sequence, named “s type” mismatches, and the number of mismatches does not exceed given thresholds for the number of mismatches of “n type”, of “s type” and a threshold obtained from a given function (f(n,s)) calculated on the number of “n type” and “s type” mismatches.
In another aspect the coding method further comprises identifying genomic reads as a fourth “Class I” when they can possibly have the same type of mismatches of “Class M”, and in addition at least one mismatch of type: “insertion” (“i type”) “deletion” (“d type”) soft clips (“c type”), and wherein the number of mismatches for each type does not exceed the corresponding given threshold and a threshold provided by a given function (w(n,s,i,d,c)) calculated on the number of “n type”, “s type”, “i type”, “d type” and “c type” mismatches.
In another aspect the coding method further comprises identifying genomic reads as a fifth “Class U” as comprising all reads that do not find any classification in the Classes P, N, M, I, as previously defined.
In another aspect the coding method further comprises that the reads of the genomic sequence to be encoded are paired.
In another aspect the coding method further comprises that said classifying further comprises identifying genomic reads as a sixth “Class HM” as comprising all reads pairs where one read belong to Class P, N, M or I and the other read belong to “Class U”.
In another aspect the coding method further comprises the steps of identifying if the two mate reads are classified in the same class (each of: P, N, M, I, U), then assigning the pair to the same identified class,
Identifying if the two mate reads are classified in different classes, and in case none of them belongs to the “Class U”, then assigning the pair of reads to the class with the highest priority defined according to the following expression:
P<N<M<I
in which “Class P” has the lowest priority and “Class I” has the highest priority;
identifying if only one of the two mate reads has been classified as belonging to “Class U” and classifying the pair of reads as belonging to the “Class HM” sequences.
In another aspect the coding method further comprises that each class of reads N, M, I of reads N, M, I is further partitioned into two or more subclasses (296, 297, 298) according to a vector of thresholds (292, 293, 294) defined respectively for each class N, M and I, by the number of “n type” mismatches (292), the function f(n,s) (293) and the function w(n,s,i,d,c) (294).
identifying if the two mate reads are classified in the same subclass, then assigning the pair to the same sub-class identifying if the two mate reads are classified into sub-classes of different Classes, then assigning the pair to the subclass belonging to the Class of higher priority according to the following expression:
N<M<I
where N has the lowest priority and I has the highest priority;
identifying if the two mate reads are classified in the same class, and such class is N or M or I, but in different sub-classes, then assigning the pair to the sub-class with the highest priority according to the following expressions:
N
1
<N
2
< . . . <N
k
M
1
<M
2
< . . . <M
j
I
1
<I
2
< . . . <I
h
where the highest index has the highest priority.
In another aspect the information on the mapping position of each read is encoded by means of a pos descriptor block.
In another aspect the information on the strandedness (i.e. the DNA strand the read was sequences from) of each read is encoded by means of a rcomp descriptor block.
In another aspect the pairing information of paired-end reads is encoded by means of a pair descriptor block.
In another aspect the additional alignment information such as if the read is mapped in proper pair, it fails platform/vendor quality checks, it is a PCR or optical duplicate or it is a supplementary alignment is encoded by means of a flags descriptor block.
In another aspect the information on unknown bases is encoded by means of a nmis descriptor block.
In another aspect the information on the position of substitutions is encoded by means of a snpp descriptor block.
In another aspect the information on the type of substitutions is encoded by means of a specific snpt descriptor block.
In another aspect the information on the position of mismatches of type substitutions, insertions or deletions is encoded by means of a indp descriptor block.
In another aspect the information on the type of mismatches such as substitutions, insertions or deletions is encoded by means of a indt descriptor block.
In another aspect the information on clipped bases of a mapped read is encoded by means of a indc descriptor block.
In another aspect the information on unmapped reads is encoded by means of a ureads descriptor block.
In another aspect the information on the type of reference sequence used for encoding is encoded by means of a rtype descriptor block.
In another aspect the information on multiple alignments of the mapped reads is encoded by means of a mmap descriptor block.
In another aspect the information on spliced alignments and multiple alignments of the same read is encoded by means of a msar descriptor and a mmap descriptor block.
In another aspect the information on read alignment scores is encoded by means of a mscore descriptor block.
In another aspect the information on the groups reads belong to is encoded by means of a specific rgroup descriptor block.
In another aspect the coding method further comprises that said blocks of descriptors comprise a master index table, containing one section for each Class and sub-class of aligned reads, said section comprising the mapping positions on said one or more reference sequences of the first read of each Access Units of each Class or sub-class of data; jointly coding said master index table and said access unit data.
In another aspect the coding method further comprises that said blocks of descriptors further comprise information related to the type of reference used (pre-existing or constructed) and the segments of the read that do not match on the reference sequence.
In another aspect the coding method further comprises that said reference sequences are first transformed into different reference sequences by applying substitutions, insertions, deletions and clipping, then the encoding of said classified aligned reads as a multiplicity of blocks of descriptors refers to the transformed reference sequences.
In another aspect the coding method further comprises that the same transformation is applied to the reference sequences for all classes of data.
In another aspect the coding method further comprises that different transformations are applied to the reference sequences per each class of data.
In another aspect the coding method further comprises that the reference sequences transformations are encoded as blocks of descriptors and structured with header information thereby creating successive Access Units.
In another aspect the coding method further comprises that the encoding of said classified aligned reads and the related reference sequences transformations as multiplicity of blocks of descriptors comprises the step of associating a specific source model and a specific entropy coder to each descriptor block.
In another aspect the coding method further comprises that said entropy coder is one of a context adaptive arithmetic coder, a variable length coder or a golomb coder.
The present invention further provides a method for decoding encoded genomic data comprising the steps of:
parsing Access Units containing said encoded genomic data to extract multiple blocks of descriptors by employing header information decoding said multiplicity of blocks of descriptors to extract aligned reads according to specific matching rules defining their classification with respect to one or more reference sequences.
In another aspect the decoding method further comprises the decoding of unmapped genomic reads.
In another aspect the decoding method further comprises the decoding of classified genomic reads.
In another aspect the decoding method further comprises decoding a master index table containing one section for each class of reads and the associated relevant mapping positions.
In another aspect the decoding method further comprises decoding information related to the type of reference used: pre-existing, transformed or constructed.
In another aspect the decoding method further comprises decoding information related to one or more transformations to be applied to the pre-existing reference sequences.
In another aspect the decoding method further comprises genomic reads that are paired.
In another aspect the decoding method further comprises the case wherein said genomic data are entropy decoded.
The present invention further provides a genomic encoder (2010) for the compression of genome sequence data 209, said genome sequence data 209 comprising reads of sequences of nucleotides, said genomic encoder (2010) comprising:
an aligner unit (201), configured to align said reads to one or more reference sequences thereby creating aligned reads,
a constructed-reference generator unit (202), configured to produce constructed reference sequences,
a data classification unit (204), configured to classify said aligned reads according to specified matching rules with the one or more pre-existing reference sequences or constructed reference sequences thereby creating classes of aligned reads (208);
one or more blocks encoding units (205-207), configured to encode said classified aligned reads as blocks of descriptors by selecting said descriptors according to said classes of aligned reads; a multiplexer (2016) for multiplexing the compressed genomic data and metadata.
In another aspect the genomic encoder further comprises
a reference sequence transformation unit (2019) configured to transform the pre-existing references and data classes (208) into transformed data classes (2018).
In another aspect the genomic encoder further comprises a
data classification unit (204) contains encoders of data classes N, M and I configured with vectors of thresholds generating sub-classes of data classes N, M and I.
In another aspect the genomic encoder further comprises the feature that reference transformation unit (2019) applies the same reference transformation (300) for all classes and sub-classes of data.
In another aspect the genomic encoder further comprises the feature that the reference transformation decoder (2019) applies different reference transformations (301, 302, 303) for the different classes and sub-classes of data.
In another aspect the genomic encoder further comprises the features suitable for executing all the aspects of the previously mentioned coding methods.
The present invention further provides a genomic decoder (218) for the decompression of a compressed genomic stream (211) said genomic decoder (218) comprising:
a demultiplexer (210) for demultiplexing compressed genomic data and metadata parsing means (212-214) configured to parse said compressed genomic stream into genomic blocks of descriptors (215),
one or more block decoders (216-217), configured to decode the genomic blocks into classified reads of sequences of nucleotides (2111),
genomic data classes decoders (219) configured to selectively decode said classified reads of sequences of nucleotides on one or more reference sequences so as to produce uncompressed reads of sequences of nucleotides.
In another aspect the genomic decoder further comprises a reference transformation decoder (2113) configured to decode reference transformation descriptors (2112) and produce a transformed reference (2114) to be used by genomic data class decoders (219).
In another aspect the genomic decoder further comprises that the one or more reference sequences are stored in the compressed genome stream (211).
In another aspect the genomic decoder further comprises that the one or more reference sequences are provided to the decoder via an out of band mechanism.
In another aspect the genomic decoder further comprises that the one or more reference sequences are built at the decoder.
In another aspect the genomic decoder further comprises that the one or more reference sequences are transformed at the decoder by a reference transformation decoder (2113).
The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned coding methods.
The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned decoding methods.
The present invention further provides a support data storing genomic encoded according perform all the aspects of the previously mentioned coding methods.
The genomic or proteomic sequences referred to in this invention include, for example, and not as a limitation, nucleotide sequences, Deoxyribonucleic acid (DNA) sequences, Ribonucleic acid (RNA), and amino acid sequences. Although the description herein is in considerable detail with respect to genomic information in the form of a nucleotide sequence, it will be understood that the methods and systems for compression can be implemented for other genomic or proteomic sequences as well, albeit with a few variations, as will be understood by a person skilled in the art.
Genome sequencing information is generated by High Throughput Sequencing (HTS) machines in the form of sequences of nucleotides (a. k. a. “bases”) represented by strings of letters from a defined vocabulary. The smallest vocabulary is represented by five symbols: {A, C, G, T, N} representing the 4 types of nucleotides present in DNA namely Adenine, Cytosine, Guanine, and Thymine. In RNA Thymine is replaced by Uracil (U). N indicates that the sequencing machine was not able to call any base and so the real nature of the position is undetermined. In case the IUPAC ambiguity codes are adopted by the sequencing machine, the alphabet used for the symbols is (A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, N or -).
The nucleotides sequences produced by sequencing machines are called “reads”. Sequence reads can be between a few dozens to several thousand nucleotides long. Some technologies produce sequence reads in “pairs” where one read is from one DNA strand and the second is from the other strand. In genome sequencing the term “coverage” is used to express the level of redundancy of the sequence data with respect to a “reference sequence”. For example, to reach a coverage of 30× on a human genome (3.2 billion bases long) a sequencing machine shall produce a total of 30×3.2 billion bases so that in average each position in the reference is “covered” 30 times.
Throughout this disclosure, a reference sequence is any sequence on which the nucleotides sequences produced by sequencing machines are aligned/mapped. One example of sequence could actually be a “reference genome”, a sequence assembled by scientists as a representative example of a species' set of genes. For example GRCh37, the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, N.Y. However, a reference sequence could also consist of a synthetic sequence conceived and constructed to merely improve the compressibility of the reads in view of their further processing. This is described in more details in section “Descriptors of Class U and construction of an “internal” references for unmapped reads of “Class U” and “Class HM“ ” and depicted in
Sequencing devices can introduce errors in the sequence reads such as:
The term “coverage” is used in literature to quantify the extent to which a reference genome or part thereof can be covered by the available sequence reads. Coverage is said to be:
This invention aims at defining a genomic information representation format in which the relevant information is efficiently accessible and transportable and the weight of the redundant information is reduced.
The main innovative aspects of the disclosed invention are the following.
1 Sequence reads are classified and partitioned into data classes according to the results of the alignment with respect to reference sequences. Such classification and partitioning enables the selective access to encoded data according to criteria related to the alignment results and to the matching accuracy.
2 The classified sequence reads and the associated metadata are represented by homogeneous blocks of descriptors to obtain distinct information sources characterized by a low information entropy.
3 The possibility of modeling each separated information source with distinct source model adapted to the statistical characteristics of each class and the possibility of changing the source model within each class of reads and within each descriptor block for each separately accessible data units (Access Units). The adoption of the appropriate context adaptive probability models and associated entropy coders according to the statistical properties of each source model.
4 The definition of correspondences and dependencies among the descriptors blocks to enable the selective access to the sequencing data and associated metadata without the need to decode all the descriptors blocks if not all information is required.
The coding of each sequence data class and associated metadata blocks with respect to “pre-existing” (also denoted as “external”) reference sequences or with respect to “transformed” reference sequences obtained by applying appropriate transformations to “pre-existing” reference sequences so as to reduce the entropy of the descriptors blocks information sources. Said descriptors represent the reads partitioned into the different data classes. Following any encoding of reads using the corresponding descriptors with reference to a “pre-existing” reference or “transformed” “pre-existing” reference sequence, the occurrence of the various mismatches can be used to define the appropriate transformations to the reference sequences in order to find a final coded representation with low entropy and achieve higher compression efficiency.
6 The construction of one or more reference sequences (also referred to as “internal” references to distinguish from the “pre-existing” also referred here as “external” reference sequences) used to encode the class of reads that present a degree of matching accuracy with respect to the pre-existing reference sequences not satisfying a set of constraints. Such constraints are set with the objective that the coding costs of representing in compressed form the class of reads aligned with respect to the “internal” reference sequences and the cost of representing the “internal” reference sequences themselves, is lower than encoding the unaligned class of reads verbatim, or using the “external” reference sequences without or with transformations.
In the following, each of the above aspects will be further described in details.
Classification of the Sequence Reads According to Matching Rules
The sequence reads generated by sequencing machines are classified by the disclosed invention into six different “classes” according to the matching results of the alignment with respect to one or more “pre-existing” reference sequences.
When aligning a DNA sequence of nucleotides with respect to a reference sequence the following cases can be identified:
Classification of Read Pairs According to Matching Rules
The classification specified in the previous section concerns single sequence reads. In the case of sequencing technologies that generates read in pairs (i.e. Illumina Inc.) in which two reads are known to be separated by an unknown sequence of variable length, it is appropriate to consider the classification of the entire pair to a single data class. A read that is coupled with another is said to be its “mate”.
If both paired reads belong to the same class the assignment to a class of the entire pair is obvious: the entire pair is assigned to the same class for any class (i.e. P, N, M, I, U). In the case the two reads belong to a different class, but none of them belongs to the “Class U”, then the entire pair is assigned to the class with the highest priority defined according to the following expression:
P<N<M<I
in which “Class P” has the lowest priority and “Class I” has the highest priority.
In case only one of the reads belongs to “Class U” and its mate to any of the Classes P, N, M, I a sixth class is defined as “Class HM” which stands for “Half Mapped”.
The definition of such specific class of reads is motivated by the fact that it is used for attempting to determine gaps or unknown regions existing in reference genomes (a.k.a. little known or unknown regions). Such regions are reconstructed by mapping pairs at the edges using the pair read that can be mapped on the known regions. The unmapped mate is then used to build the so called “contigs” of the unknown region as it is shown in
The table below summarizes the matching rules applied to reads in order to define the class of data each read belongs to. The rules are defined in the first five columns of the table in terms of presence or absence of type of mismatches (n, s, d, i and c type mismatches). The sixth column provide rules in terms of maximum threshold for each mismatch type and any function f(n,s) and w(n,s,d,i,c) of the possible mismatch types.
Matching Rules Partition of Sequence Read Data Classes N, M and I into Subclasses with Different Degrees of Matching Accuracy
The data classes of type N, M and I as defined in the previous sections can be further decomposed into an arbitrary number of distinct sub-classes with different degrees of matching accuracy. Such option is an important technical advantage in providing a finer granularity and as consequence a much more efficient selective access to each data class. As an example and not as a limitation, to partition the Class N into a number k of subclasses (Sub-Class N1, . . . , Sub-Class Nk) it is necessary to define a vector with the corresponding components MAXN1, MAXN2, . . . MAXN(k-1), MAXN(k), with the condition that MAXN1<MAXN2< . . . <MAXN(k-1)<MAXN and assign each read to the lowest ranked sub-class that satisfy the constrains specified in Table 1 when evaluated for each element of the vector. This is shown in
In the case of the classes of type M and I the same principle is applied by defining a vector with the same properties for MAXM and MAXTOT respectively and use each vector components as threshold for checking if the functions f(n,s) and w(n,s,d,l,c) satisfy the constraint. Like in the case of sub-classes of type N, the assignment is given to the lowest sub-class for which the constraint is satisfied. The number of sub-classes for each class type is independent and any combination of subdivisions is admissible. This is shown in
When two reads in a pair are classified in the same sub-class, then the pair belongs to the same sub-class.
When two reads in a pair are classified into sub-classes of different classes, then the pair belongs to the sub-class of the class of higher priority according to the following expression:
N<M<I
where N has the lowest priority and I has the highest priority.
When two reads belong to different sub-classes of one of classes N or M or I, then the pair belongs to the sub-class with the highest priority according to the following expressions:
N
1
<N
2
< . . . <N
k
M
1
<M
2
< . . . <M
1
I
1
<I
2
< . . . <I
h
where the highest index has the highest priority.
Transformations of the “External” Reference Sequences
The mismatches found for the reads classified in the classes N, M and I can be used to create “transformed” references to be used to compress more efficiently the read representation. Reads classified as belonging to the Classes N, M or I (with respect to the “pre-existing” (i.e. “external”) reference sequence denoted as RS0) can be coded with respect to the “transformed” reference sequence RS1 according to the occurrence of the actual mismatches with the “transformed” reference. For example if readMin belonging to Class M (denoted as the ith read of class M) containing mismatches with respect to the reference sequence RSn, then after “transformation” readMin=readPi(n+1) can be obtained with A(Refn)=Refn+1 where A is the transformation from reference sequence RSn to reference sequence RSn+1.
RS2=A(RS1)
When the representation of the transformation A which generates RS2 when applied to RS1 plus the representation of the reads versus RS2 corresponds to a lower entropy than the representation of the reads of class M versus RS1, it is advantageous to transmit the representation of the transformation A and the corresponding representation of the read versus RS2 because an higher compression of the data representation is achieved.
The coding of the transformation A for transmission in the compressed bitstream requires the definition of two additional descriptors as defined in the table below.
It has to be observed that, in some cases the transformation applied to the reference:
Definition of the Information Necessary to Represent Sequence Reads into Blocks of Descriptors
Once the classification of reads is completed with the definition of the Classes, further processing consists in defining a set of distinct descriptors which represent the remaining information enabling the reconstruction of the read sequence when represented as being mapped on a given reference sequence. The data structure of these descriptors requires the storage of global parameters and metadata to be used by the decoding engine. These data are structured in a Genomic Dataset Header described in the table below. A dataset is defined as the ensemble of coding elements needed to reconstruct the genomic information related to a single genomic sequencing run and all the following analysis. If the same genomic sample is sequenced twice in two distinct runs, the obtained data will be encoded in two distinct datasets.
A sequence read (i.e. a DNA segment) referred to a given reference sequence can be fully expressed by:
This classification creates groups of descriptors (descriptors) that can be used to univocally represent genome sequence reads. The table below summarizes the descriptors needed for each class of reads aligned with “external” (i.e. “pre-existing”) or “internal” (i.e. “constructed”)
Reads belonging to class P are characterized and can be perfectly reconstructed by only a position, a reverse complement information and an offset between mates in case they have been obtained by a sequencing technology yielding mated pairs, some flags and a read length. The next section further details how these descriptors are defined for classes P, N, M and I while for class U they are described in a following section
Class HM is applied to read pairs only and it is a special case for which one read belongs to class P, N, M or I and the other to class U.
Position Descriptor
In the position (pos) block only the mapping position of the first encoded read is stored as absolute value on the reference sequence. All the other position descriptors assume a value expressing the difference with respect to the previous position. Such modeling of the information source defined by the sequence of read position descriptors is in general characterized by a reduced entropy particularly for sequencing processes generating high coverage results.
For example,
Reverse Complement Descriptor
Each read of the read pairs produced by sequencing technologies can be originated from either genome strands of the sequenced organic sample. However, only one of the two strands is used as reference sequence.
Pairing Information Descriptor
The pairing descriptor is stored in the pair block. Such block stores descriptors encoding the information needed to reconstruct the originating reads pairs when the employed sequencing technology produces reads by pairs. Although at the date of the disclosure of the invention the vast majority of sequencing data is generated by using a technology generating paired reads, it is not the case of all technologies. This is the reason for which the presence of this block is not necessary to reconstruct all sequencing data information if the sequencing technology of the genomic data considered does not generate paired reads information.
The pair descriptor block is the vector of pairing errors calculated as number of reads to be skipped to reach the mate pair of the first read of a pair with respect to the defined decoding pairing distance.
The same descriptors are used for the pairing information of reads belonging to classes N, M, P and I. In order to enable the selective access to the different data classes, the pairing information of reads belonging to the four classes are encoded in different block as depicted in
Pairing Information in Case of Reads Mapped on Different Reference Sequences
In the process of mapping sequence reads on a reference sequence it is not uncommon to have the first read in a pair mapped on one reference sequence (e.g. chromosome 1) and the second on a different reference sequence (e.g. chromosome 4). In this case the pairing information described above has to be integrated by additional information related to the reference sequence used to map one of the reads. This is achieved by coding:
In
1) One special reserved value is encoded as pairing distance (in this case 0xffffff).
2) A second descriptor provides a reference ID as listed in the main header (in this case 4).
3) The third element contains the mapping information on the concerned reference (170).
Mismatch Descriptors for Class N Reads
Class N includes all reads in which only “n type” mismatches are present, at the place of an A, C, G or T base a N is found as called base. All other bases of the read perfectly match the reference sequence.
the positions of “N” in read 1 are coded as
the positions of “N” in read 2 are coded as
In the nmis block, the coding of each reads pair is terminated by a special “separator” symbol.
Descriptors Coding Substitutions (Mismatches or SNPs), Insertions and Deletions
A substitution is defined as the presence, in a mapped read, of a different nucleotide base with respect to the one that is present in the reference sequence at the same position.
Source Model 1: Substitutions as Positions and Types
Substitutions Positions Descriptors
A substitution position is calculated like the values of the nmis block, i.e.
In read 1 substitutions are encoded
In the snpp block, the coding of each reads pair is terminated by a special “separator” symbol.
Substitutions Types Descriptors
For class M (and I as described in the next sections), mismatches are coded by an index (moving from right to left) from the actual symbol present in the reference to the corresponding substitution symbol present in the read {A, C, G, T, N, Z}. For example if the aligned read presents a C instead of a T which is present at the same position in the reference, the mismatch index will be denoted as “4”. The decoding process reads the encoded descriptor, the nucleotide at the given position on the reference and moves from left to right to retrieve the decoded symbol. E.g. a “2” received for a position where a G is present in the reference will be decoded as “N”.
In case of adoption of the IUPAC ambiguity codes the substitution mechanism results to be exactly the same however the substitution vector is extended as: S={A, C, G, T, N, Z, M, R, W, S, Y, K, V, H, D, B}.
Some examples of substitutions encoding when IUPAC ambiguity codes are adopted are provided in
Encoding of Insertions and Deletions
For class I, mismatches and deletions are coded by an indexes (moving from right to left) from the actual symbol present in the reference to the corresponding substitution symbol present in the read: {A, C, G, T, N, Z}. For example if the aligned read presents a C instead of a T present at the same position in the reference, the mismatch index will be “4”. In case the read presents a deletion where a “A” is present in the reference, the coded symbol will be “5”. The decoding process reads the coded descriptor, the nucleotide at the given position on the reference and moves from left to right to retrieve the decoded symbol. E.g. a “3” received for a position where a G is present in the reference will be decoded as “Z”.
Inserts are coded as 6, 7, 8, 9, 10, respectively for inserted A, C, G, T, N.
Source Model 2: One Block Per Substitution Type and Indels
For some data statistics a different coding model from the one described in the previous section can be developed for substitutions and indels resulting into a source with lower entropy. Such coding model is an alternative to the techniques described above for mismatches only and for mismatches and indels.
In this case one data block is defined for each possible substitution symbol (5 without IUPAC codes, 16 with IUPAC codes), plus one block for deletions and 4 more blocks for insertions. For simplicity of the explanation, but not as a limitation for the application of the model, the following description will focus on the case where no IUPAC codes are supported.
Encoding of Additional Signaling Flags
Each data class introduced above (P, M, N, I) may require the encoding of additional information on the nature of the encoded reads. This information may be related for example to the sequencing experiment (e.g. indicating a probability of duplication of one read) or can express some characteristic of the read mapping (e.g. first or second in pair). In the context of this invention this information is encoded in a separate block for each data class. The main advantage of such approach is the possibility to selectively access this information only in case of need and only in the required reference sequence region. Other examples of the use of such flags are:
Descriptors for Class U and Construction of “Internal” References for Unmapped Reads of “Class U” and “Class HM”
In the case of the reads belonging to Class U or the unmapped pair of “Class HM” since they cannot be mapped to any “external” reference sequence satisfying the specified set of matching accuracy constraints for belonging to any of the classes P, N, M, or I, one or more “internal” reference sequences are “constructed” and used for the compressed representation of the reads belonging to these data classes.
Several approaches are possible to construct appropriate “internal” references such as for instance and not as limitation:
If the read being coded can be mapped on the “internal” reference satisfying the specified set of matching accuracy constraints, the information necessary to reconstruct the read after compression is coded using descriptors that can be of the following types:
The specific type of padding strategy will be signaled by special values in the indc block header
In case reads of class U present variable length, an additional descriptor rlen is used to store each read length.
This coding approach can be extended to support N start positions per read so that reads can be split over two or more reference positions. This can be particularly useful to encode reads generated by those sequencing technology (e.g. from Pacific Bioscience) producing very long reads (50K+ bases) which usually present repeated patterns generated by loops in the sequencing methodology. The same approach can be used as well to encode chimeric sequence reads defined as reads that align to two distinct portions of the genome with little or no overlap.
The approach described above can be clearly applied beyond the simple class U and could be applied to any block containing descriptors related to reads positions (pos blocks).
Alignment Score Descriptor
The mscore descriptor provides a score per alignment. In the context of this invention it is used to represent mapping/alignment score per read generated by genomic sequence reads aligners. The score is expressed using an exponent and fractional part. The number of bits used to represent the exponent and the fractional part are transmitted as configuration parameters. As an example, but not as a limitation, Table 2 shows how this is specified in IEEE RFC 754 for an 11-bits exponent and a 52-bits fractional part.
The score of each alignment can be represented by:
The base (radix) to be used for the calculation of scores is 10, therefore:
score=−1s×10E×M
Reads Groups
During the sequencing process different types of sequenced reads can be produced. As an example but not as a limitation types can be related to different sequenced samples, different experiments, different configuration of the sequencing machine. After sequencing and alignment this information is preserved, according to the disclosed invention, by means of a dedicated descriptor named rgroup. rgroup is a label associated to each encoded read and enables a decoding apparatus to partition the decoded reads in groups after decoding.
Descriptors for Multiple Alignments
The following descriptors are specified for the support of multiple alignments. In case of presence of spliced reads, this invention defines a global flag spliced_reads_flag to be set to 1.
mmap Descriptor
The mmap descriptor is used to signal on how many positions the read or the left-most read of a pair has been aligned. A Genomic Record containing multiple alignments is associated with one multi-byte mmap descriptor. The first two bytes of a mmap descriptor represent an unsigned integer N which refers to the read as a single segment (if no splices are present in the encoded dataset) or instead to all the segments into which the read has been spliced for the several possible alignments (if splices are present in the dataset). The value of N says how many values of the pos descriptor are coded for the template in this record. N is followed by one or more unsigned integers Mi as described below.
Multiple Alignments Strandedness
The rcomp descriptor described in this invention is used to specify the strandedness of each read alignment using the syntax specified in this invention.
Scores of Multiple Alignments
In case of multiple alignments one mscore as specified in this invention is assigned to each alignment.
Multiple Alignments without Splices
If no splices are present in the Access Unit, spliced_reads_flag is unset.
In paired-end sequencing, the mmap descriptor is composed of a 16-bit unsigned integer N followed by one or more 8-bit unsigned integers Mi, with i assuming values from 1 to the number of complete first (here, the left-most) read alignments. For each first read alignment, spliced or not, Mi is used to signal how many segments are used to align the second read (in this case, without splices, this is equal to the number of alignments), and then how many values of the pair descriptor are coded for that alignment of the first read.
The values of Mi shall be used to calculate P=Σi=1NMi which indicates the number of alignments of the second read.
A special value of Mi (Mi=0) indicates that the ith alignment of the left-most read is paired with an alignment of the right-most read which is already paired with a kth alignment of the left-most read with k<i (then there is no new alignment detected, which is consistent with the equation above).
As an example, in the simplest cases:
When Mi is 0, the associated value of pair shall link to an existing second read alignment; a syntax error will be raised otherwise and the alignment considered broken.
Example: if the first read has two mapping positions and the second read only one, N is 2, M1 is 1 and M2 is 0 as said earlier. If this is followed by another alternative secondary mapping for the entire template, N will be 3, and M3 will be 1.
39 illustrates the meaning of N, P and Mi in case of multiple alignments without splices and Error! Reference source not found. shows how the pos, pair and mmap descriptors are used to encode the multiple alignments information.
With respect to 40 the following applies:
Multiple Alignments with Splices
If the dataset is encoded with spliced reads, the msar descriptor enables representation of splices length and strandedness.
After having decoded the mmap and the msar descriptors, the decoder knows how many reads or read pairs have been encoded to represent the multiple mappings and how many segments are composing each read or read pair mapping. This is shown in
With reference to
With reference to
Alignment Score
The mscore descriptor allows signaling the mapping score of an alignment. In single-end sequencing it will have N1 values per template; in paired-end sequencing it will have a value for each alignment of the entire template (number of different alignments of the first read possibly+the number of further second read alignments, i.e. when Mi−1>0).
Number of scores=MAX(N1,N2)+M0
where M0 represent the total number of Mi=0.
In this invention more than one score value can be associated to each alignment. The number of alignments is signaled by a configuration parameter as_depth.
Descriptors for Multiple Alignments without Splices
Descriptors for Multiple Alignments with Splices
Table 4 shows the determination of the number of descriptors needed to represent multiple alignments in one Genomic Record in case of multiple alignments with splices.
Multiple Alignments on Different Sequences
It may happen that the alignment process finds alternative mappings to another reference sequence than the one where the primary mapping is positioned.
For read pairs that are uniquely aligned, a pair descriptor shall be used to represent the absolute read positions when there is for example a chimeric alignment with the mate on another chromosome. The pair descriptor shall be used to signal the reference and the position of the next record containing further alignments for the same template. The last record (e.g. the third if alternative mappings are coded in 3 different AUs) shall contain the reference and position of the first record.
In case one or more alignments for the left-most read in a pair are present on a different reference sequence than the one related to the currently encoded AU, then a reserved value is used for the pair descriptor. The reserved value is followed by the reference sequence identifier and the position of the left-most alignment among all those contained in the next AU (i.e. the first decoded value of the pos descriptor for that record).
Multiple Alignments with Insertions, Deletions, Unmapped Portions
When an alternative secondary mapping does not preserve the contiguity of the reference region where the sequence is aligned, it may be impossible to reconstruct the exact mapping generated by the aligner because the actual sequence (and then the descriptors related to mismatches such as substitutions or indels) is only coded for the primary alignment. The msar descriptor shall be used to represent how secondary alignments map on the reference sequence in case they contain indels and/or soft clips. If msar is represented by the special symbol “*” for a secondary alignment, the decoder will reconstruct the secondary alignment from the primary alignment and the secondary alignment mapping position.
msar Descriptor
The msar (Multiple Segments Alignment Record) descriptor supports spliced reads and alternative secondary alignments that contain indels or soft clips.
msar is intended to convey information on:
msar is used the syntax of the extended CIGAR string described below plus the additional symbol described in Table 5.
Extended Cigar Syntax
This section specifies an extended CIGAR (E-CIGAR) syntax for strings to be associated to sequences and related mismatches, indels, clipped bases and information on multiple alignments and spliced reads.
Edit operations described in this invention are listed in Table 6.
Source Models, Entropy Coders and Coding Modes
For each data class, sub-class and associated descriptor block of the genomic data structure disclosed in this invention different coding algorithms may be adopted according to the specific features of the data or metadata carried by each block and its statistical properties. The “coding algorithm” has to be intended as the association of a specific “source model” of the descriptor block with a specific “entropy coder”. The specific “source model” can be specified and selected to obtain the most efficient coding of the data in terms of minimization of the source entropy. The selection of the entropy coder can be driven by coding efficiency considerations and/or probability distribution features and associated implementation issues. Each selection of a specific “coding algorithm”, also referred to as “coding mode” can be applied to an entire “descriptor block” associated to a data class or sub-class for the entire data set, or different “coding modes” can be applied for each portion of descriptors partitioned into Access Units. Each “source model” associated to a coding mode is characterized by:
Further Advantages
The classification of sequence data into the defined data classes and sub-classes permits the implementation of efficient coding modes exploiting the lower information source entropy characterizing by modelling the sequences of descriptors by single separate data sources (e.g. distance, position, etc.).
Another advantage of the invention is the possibility to access only the subset of type of data of interest. For example one of the most important application in genomics consists in finding the differences of a genomic sample with respect to a reference (SNV) or a population (SNP). Today such type of analysis requires the processing of the complete sequence reads whereas by adopting the data representation disclosed by the invention the mismatches are already isolated into one to three data classes only (depending on the interest in considering also “n type” and “i type” mismatches).
A further advantage is the possibility of performing efficient transcoding from data and metadata compressed with reference to a specific “external” reference sequence to another different “external” reference sequence when new reference sequences are published or when re-mapping is performed on the already mapped data (e.g. using a different mapping algorithm) obtaining new alignments.
Class decoders 219 are able to reconstruct the original genomic sequences by leveraging the information on the original reference sequences carried by one or more genomic streams and the reference transformation descriptors 2112 carried in the encoded bitstream. In case the reference sequences are not transported by the genomic streams they must be available at the decoding side and accessible by the class decoders.
The inventive techniques herewith disclosed may be implemented in hardware, software, firmware or any combination thereof. When implemented in software, these may be stored on a computer medium and executed by a hardware processing unit. The hardware processing unit may comprise one or more processors, digital signal processors, general purpose microprocessors, application specific integrated circuits or other discrete logic circuitry. The techniques of this disclosure may be implemented in a variety of devices or apparatuses, including mobile phones, desktop computers, servers, tablets and similar devices.
File Format: Selective Access to Regions of Genomic Data by Using the Master Index Table In order to support selective access to specific regions of the aligned data, the data structure described in this document implements an indexing tool called Master Index Table (MIT). This is a multi-dimensional array containing the loci at which specific reads map on the associated reference sequences. The values contained in the MIT are the mapping positions of the first read in each pos block so that non-sequential access to each Access Unit is supported. The MIT contains one section per each class of data (P, N, M, I, U and HM) and per each reference sequence. The MIT is contained in the Genomic Dataset Header of the encoded data.
The values contained in the MIT depicted in
For example, with reference to
Together with pointers to the block containing the data belonging to the four classes of genomic data described above, the MIT can be uses as an index of additional metadata and/or annotations added to the genomic data during its life cycle.
Local Index Table
Each genomic data block is prefixed with a data structure referred to as local header. The local header contains a unique identifier of the block, a vector of Access Units counters per each reference sequence, a Local Index Table (LIT) and optionally some block specific metadata. The LIT is a vector of pointers to the physical position of the data belonging to each Access Unit in the block payload.
In the previous example, in order to access region 150,000 to 250,000 of reads aligned on the reference sequence no. 2, the decoding application retrieved positions 3 and 4 from the MIT. These values shall be used by the decoding process to access the 3rd and 4th elements of the corresponding section of the LIT. In the example shown in
position of the data blocks belonging to the requested AU=data blocks belonging to AUs of reference 1 to be skipped+position retrieved using the MIT, i.e.
First block position: 5+3=8
Last block position: 5+4=9
The blocks of data retrieved using the indexing mechanism called Local Index Table, are part of the Access Units requested.
In an embodiment of this invention, the LIT can be integrated as a substructure of the MIT. The advantage of such approach is the speed of access to the indexed data in case of sequential parsing of the compressed file. If the LIT is integrated in the MIT in the file header, a decoding device would need to parse only a small portion of data to retrieve the requested compressed information in case of selective access. Another advantage is evident, to a person skilled in the art, in case of streaming on a network, when the indexing information contained in the MIT and LIT would be delivered among the first data blocks therefore enabling the receiving device to perform operations such as sorting and selective access before the entire data transfer is completed.
Access Units
The genomic data classified in data classes and structured in compressed or uncompressed blocks are organized into different Access Units.
Genomic Access Units (AU) are defined as sections of genome data (in a compressed or uncompressed form) that reconstructs nucleotide sequences and/or the relevant metadata, and/or sequence of DNA/RNA (e.g. the virtual reference) and/or annotation data generated by a genome sequencing machine and/or a genomic processing device or analysis application. An example of Access Unit is provided in
An Access Unit is a block of data that can be decoded either independently from other Access Units by using only globally available data (e.g. decoder configuration) or by using information contained in other Access Units.
Access Units are differentiated by:
Access units of any type can be further classified into different “categories”. Hereafter follows a non-exhaustive list of definition of different types of genomic Access Units:
Access Units of type 0 are ordered (e.g. numbered), but they do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming, multiplexing)
Access Units of type 1, 2, 3, 4, 5 and 6 do not need to be ordered and do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming).
Each Access unit can have a different number of packets in each block, but within an Access Unit all blocks have the same number of packets.
Each data packet can be identified by the combination of 3 identifiers X Y Z where:
Access Units of any type can be further classified and labelled in different “categories” according to different sequencing processes. For example, but not as a limitation, classification and labelling can take place when
Number | Date | Country | Kind |
---|---|---|---|
PCT/US17/17842 | Feb 2017 | US | national |
PCT/US17/41591 | Jul 2017 | US | national |
This application claims priority to and the benefit of Patent Applications PCT/US2017/017842 filed on Feb. 14, 2017, and PCT/US2017/041591 filed on Jul. 11, 2017.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2018/018092 | 2/14/2018 | WO | 00 |