This disclosure provides a novel method of representation for genome sequencing data aligned with respect to one or more reference sequences having multiple alignment coordinates or alignments which require the fragmentation of reads into smaller segments (also known as “spliced reads”). The disclosed representation reduces the utilized storage space and improves access performance by providing new functionality that are not available with known prior art methods of representation.
An appropriate representation of genome sequencing data is fundamental to enable efficient genomic analysis applications such as genome variants calling and all other analysis performed with various purposes by processing genome sequencing data and metadata.
Human genome sequencing has become affordable by the emergence of high-throughput low cost sequencing technologies. Such opportunity opens new perspectives in several fields ranging from the diagnosis and treatment of cancer to the identification of genetic illnesses, from pathogen surveillance for the identification of antibodies to the creation of new vaccines, drugs and the customization of personalized treatments.
Hospitals, genomic analysis providers, bioinformatics and large biological data storage centers are looking for affordable, fast, reliable and interconnected genomic information processing solutions which could enable scaling genomic medicine to a world-wide scale. Since one of the bottleneck in the sequencing process has become data storage, methods for representing genome sequencing data in a compressed form are increasingly investigated.
The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).
Another important limitation of SAM is the lack of appropriate support of the representation of multiple alignments (also known as multiple mappings) associated to genomic sequence reads or read pairs. Genomic sequence reads alignment is a process consisting in reconstructing the genomic information of a sequenced sample from the sequence reads generated by Next Generation Sequencing technology. The reconstruction can be performed either without any prior knowledge of the originating genome, or using a pre-existing genome as a reference. The latter approach is known in the field as “reference-based alignment”. In reference-based alignment the genomic sequence reads generated from a sequenced sample are compared to a pre-existing reference sequence to find the region of the reference sequence that presents the smallest number of differences (if any) with respect to the sequence read. This process is called “aligning” or “mapping” the sequence reads against a reference sequence.
Due to the repetitive nature of some genomic regions, sequence reads can be aligned to several positions with the same degree of accuracy. For example the same sequence read can perfectly (i.e. without any mismatch) match two or more segments of the same length on the reference sequence. In this case the two or more alignments are considered to be equivalent and the sequence read is said to have “multiple alignments”. This case is illustrated in
In some cases, in order to find alignment position that meet pre-established matching criteria such as the maximum number of mismatches, sequence reads have to be split into two or more sub-segments. In this case the reads are called “spliced reads” and each sub-segment is called a “splice”. This case is illustrated in
The current specification of SAM does not support the representation of multiple alignments and splices using the eleven mandatory fields, but requires the use of optional fields that are used in different and thus inefficient ways by different implementations of tools used for sequence reads alignment. The invention described in this document provides a solution to the problem of both representing multiple alignments and spliced reads and preserving compression and accessibility efficiency.
A more sophisticated approach to genomic data compression that is less used, but more efficient than BAM in terms of compression is CRAM. CRAM provides more efficient compression for the adoption of differential encoding with respect to a reference (it partially exploits the data source redundancy), but it still lacks features such as incremental updates, support for streaming and selective access to specific classes of compressed data and appropriate representation of multiple alignments and spliced reads.
These approaches generate poor compression ratios and data structures that are difficult to navigate and manipulate once compressed. Downstream analysis can be very slow due to the necessity of handling large and rigid data structures even to perform simple operation or to access selected regions of the genomic dataset. CRAM relies on the concept of the CRAM record. Each CRAM record represents a single mapped or unmapped reads by coding all the elements necessary to reconstruct it.
CRAM presents the following drawbacks and limitations that are solved and removed by the invention described in this document:
1. CRAM does not support data indexing and random access to data subsets sharing specific features. Data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it is implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded (i.e. compressed) bit stream.
2. CRAM is built by core data blocks that can contain any type of mapped reads (perfectly matching reads, reads with substitutions only, reads with insertions or deletions (also referred to as “indels”)). There is no notion of data classification and grouping of reads in classes according to the result of mapping with respect to a reference sequence. This means that all data need to be inspected even if only reads with specific features are searched. Such limitation is solved by the invention by classifying and partitioning data in classes before coding.
3. CRAM is based on the concept of encapsulating each read into a “CRAM record”. This implies the need to inspect each complete “record” when reads characterized by specific biological features (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) are searched.
Conversely, in the present invention there is the notion of data classes coded separately in separated information layers and there is no notion of record encapsulating each read. This enables more efficient access to set of reads with specific biological characteristics (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) without the need of decoding each (block of) read(s) to inspect its features.
4. In a CRAM record each record field is associated to a specific flag and each flag must always have the same meaning as there is no notion of context since each CRAM record can contain any different type of data. This coding mechanism introduces redundant information and prevents the usage of efficient context based entropy coding.
Instead in the present invention, there is no notion of flag denoting data because this is intrinsically defined by the information “layer” the data belongs to. This implies a largely reduced number of symbols to be used and a consequent reduction of the information source entropy which results into a more efficient compression. Such improvement is possible because the use of different “layers” enables the encoder to reuse the same symbol across each layer with different meanings according to the context. In CRAM each flag must always have the same meaning as there is no notion of contexts and each CRAM record can contain any type of data.
5. In CRAM substitutions, insertions and deletions are represented by using different syntax elements, option that increases the size of the information source alphabet and yields a higher source entropy. Conversely, the approach of the disclosed invention uses a single alphabet and encoding for substitutions, insertions and deletions. This makes the encoding and decoding process simpler and produces a lower entropy source model which coding yields bitstreams characterized by high compression performance.
6. CRAM lacks the support of an appropriate representation of multiple alignments and spliced reads for both single reads and paired-end reads.
Once mapped at one or more positions on a reference sequence, genomic sequence reads can either perfectly match the reference sequence segment they are mapped to or present some mismatches. Mismatches can be of type:
The present invention aims at compressing genomic sequences by classifying data according to the result of the sequence reads alignment process and partitioning sequencing data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are directly enabled in the compressed domain.
When genomic data are classified according to the result of the alignment process, the representation in terms of syntax elements disclosed in this invention enables more efficient entropy coding, selective access to data and incremental updates. Enhanced compression is due to the fact that the data are partitioned into independent data blocks with homogeneous statistical properties. When data are structured in such blocks which can be independently decompressed, selective access requires less computational power and bandwidth and incremental update is possible with the addition of new encoded data blocks without the need of re-encoding the entire dataset.
One of the aspects of the presented approach is the definition of classes of data and metadata structured in different layers and encoded separately. The more relevant improvements of such approach with respect to existing methods consist in:
1. the increase of compression performance due to the reduction of the information source entropy constituted by providing an efficient source model for each class of data or metadata;
2. the possibility of performing selective accesses to portions of the compressed data and metadata for any further processing purpose directly in the compressed domain;
3. the possibility to incrementally (i.e. without the need of decoding and re-encoding) update compressed data and metadata with new sequencing data and/or metadata and/or new analysis results associated to specific sets of sequencing reads.
The representation of genomic sequence data disclosed in this invention relies on the concept of “descriptors”. A descriptor is defined as an element of the syntax used to represent genomic sequence data to be compressed using entropy coders. The representation of the original genomic sequence data by means of different descriptors enables more efficient compression and enhanced selective access to data. More efficient compression is achieved by employing different entropy coders per each type of descriptor or sub-sets of descriptors sharing the same statistical properties. More efficient selective access is enabled by the definition of descriptors that enable partitioning the genomic information in sub-set of data according to the different biological meaning. Since each sub-set of data can be decoded independently of the other data, the required processing power is reduced and the decoding time is shorter.
The descriptors defined in this invention disclosure are structured in a multiplicity of “Descriptor Streams” which are blocks of compressed syntax element of the same type.
The features of the claims below solve the problem of existing prior art solutions by providing:
a method for encoding genome sequence data, said genome sequence data comprising reads of sequences of nucleotides, said method comprising the steps of:
aligning said reads to one or more reference sequences thereby creating aligned reads,
classifying said aligned reads according to specified matching rules with said one or more reference sequences, thereby creating classes of aligned reads,
encoding said classified aligned reads as a multiplicity of streams of syntax elements,
wherein encoding said classified aligned reads as a multiplicity of streams of syntax elements comprises selecting said syntax elements according to said classes of aligned reads,
providing said streams of syntax elements with header information thereby creating successive data blocks in order to entropy code said genomic data blocks into separately accessible data units.
In another aspect the encoding method further comprises:
classifying said reads that do not satisfy said specified matching rules into a class of unmapped reads encoding said classified unmapped reads as a multiplicity of streams of syntax elements,
providing said streams of syntax elements and said encoded reference sequences with header information thereby creating successive Access Units.
In another aspect the encoding method further comprises that said classifying comprises identifying genomic reads having multiple alignment positions on the reference sequences used for alignment.
In another aspect the encoding method further comprises that said classifying comprises identifying genomic reads which need to be split into multiple segments named splices to satisfy the matching rules for alignment.
In another aspect the encoding method further comprises that the reads of the genomic sequence to be encoded are paired.
In another aspect the encoding method further comprises the steps of:
identifying the number of alignments of a read and representing this number with a specific syntax element,
identifying per each alignment the corresponding mapping position and representing each mapping position with a specific syntax element,
assigning an alignment score to each alignment to identify primary and secondary alignments,
identifying as primary alignment the alignment with the highest score,
identifying if any alignment is found on a different reference than the primary and representing this information using a specific descriptor,
identifying if any alignment does not preserve the a different contiguity on the reference sequence of the primary alignment and representing this information using a specific syntax element.
In another aspect the encoding method further comprises the steps of:
identifying reads which need to be split into two or more splices in order to be aligned on the reference sequence according to pre-defined matching rules defining the matching with said one or more reference sequences,
signaling the presence of spliced reads using a global configuration parameter,
representing the number of splices using a specific syntax element,
representing the length of each splice using a specific syntax element,
In another aspect the encoding method further comprises the steps of:
identifying the number of alignments of each read in a pair and representing this number with a specific syntax element,
identifying per each alignment of the left-most read in the pair the corresponding mapping position and representing each mapping position with a specific syntax element,
identifying per each alignment of the left-most read the associated alignments of the right-most read in the pair and representing the association with a specific syntax element,
assigning an alignment score to each pair of alignments to identify primary and secondary alignments,
identifying as primary alignment the pair of alignments with the highest score,
identifying if any alignment is found on a different reference than the primary and representing this information using a specific descriptor,
identifying if any alignment presents a different contiguity on the reference sequence than the primary alignment and representing this information using a specific syntax element.
In another aspect the encoding method further comprises the steps of:
identifying reads which need to be split into two or more splices in order to be aligned on the reference sequence according pre-defined matching rules,
signaling the presence of spliced reads using a global configuration parameter,
representing the number of splices of the left-most read in a pair using a specific syntax element,
representing the number of splices of the right-most read associated to each alignment of the left-most read with a vector of specific syntax elements,
representing the length of each splice using a specific syntax element.
In another aspect the encoding method further comprises that:
said streams of syntax elements comprise a genomic dataset header comprising,
a dataset group identifier used to uniquely identify each dataset group,
a genomic dataset identifier used to uniquely identify each dataset,
a brand identifier used to identify the data format specification the dataset complies with,
a minor version number used to identify the data format specification the dataset complies with,
the length of encoded genomic reads in nucleotides used to signal constant length reads,
a flag signaling the presence of paired end reads,
a flag signaling the presence of blocks headers,
a flag signaling the mode in which Access Units are stored on a storage medium in order to facilitate data access when decoding said Access Units,
the type of alphabet used to encode mismatches of sequence reads with respect to the reference sequences,
the number of reference sequences used to code the dataset,
a numeric identifier per each reference sequence used to uniquely identify each reference sequence,
a string identifier per each reference sequence used to uniquely identify each reference sequence,
the number of coded Access Units per reference sequence used to count the Access Units associated to each reference sequence,
the type of coded genomic data used to distinguish among aligned reads, unaligned reads,
unmapped reads and reference sequences,
the number of data classes coded in the dataset,
the number of descriptors used per each data class coded in the dataset used during the decoding process,
the total number of clusters used to index encoded unmapped reads,
the number of bits used to represent the integer values used to encode the cluster signatures used to decode encoded clusters signatures,
a flag signaling if all the cluster signatures have the same length in terms of number of nucleotides the length of the clusters signatures.
In another aspect the encoding method further comprises that:
said streams of syntax elements comprise a master index table, containing one section for each Class and sub-class of aligned reads, said section comprising,
the mapping positions on said one or more reference sequences of the primary alignment of the left-most read of each Access Units of each Class or sub-class of data,
the position on said one or more reference sequences of the right-most mapped base among all primary alignments of each Access Units of each Class or sub-class of data,
the offset in bytes of each block of coded syntax elements composing each Access Unit,
In another aspect the encoding method further comprises that said Master Index Table further contains the size of each coded descriptors stream.
In another aspect the encoding method further comprises that said Master Index Table further contains the size of each Access Unit.
In another aspect the encoding method further comprises that genomic reads have multiple alignments and said Master Index Table contains:
the mapping positions on said one or more reference sequences of the left-most alignment among all reads of each Access Units of each Class or sub-class of data,
the position on said one or more reference sequences of the right-most mapped base among all alignments of each Access Units of each Class or sub-class of data,
the offset in bytes of each block of coded syntax elements composing each Access Unit.
In another aspect the encoding method further comprises that said Access Units contain coded read pairs.
In another aspect the encoding method further comprises that said Master Index Table is jointly coded with said Access Unit data.
In another aspect the encoding method further comprises that said Genomic Dataset Header is jointly coded with said Access Unit data.
In another aspect the encoding method further comprises that said streams of syntax elements further comprise information related to the type of reference used (pre-existing or constructed) and the segments of the read that do not match on the reference sequence.
In another aspect the encoding method further comprises that said classified aligned reads as multiplicity of streams of syntax elements comprises the step of associating a specific source model and a specific entropy coder to each descriptor stream.
In another aspect the encoding method further comprises that said entropy coder is one of a context adaptive arithmetic coder, a variable length coder or a golomb coder.
A method for decoding encoded genomic data comprising the steps of:
parsing Access Units containing said encoded genomic data to extract multiple streams of syntax elements by employing header information,
decoding said multiplicity of streams of syntax elements to extract aligned reads according to specific matching rules defining their classification with respect to one or more reference sequences.
In another aspect the decoding method further comprises decoding unmapped genomic reads.
In another aspect the decoding method further comprises decoding a genomic dataset header containing global configuration parameters.
In another aspect the decoding method further comprises decoding a master index table containing one section for each class of reads and the associated relevant mapping positions and coded blocks offsets.
In another aspect the decoding method further comprises decoding information related to the type of reference used: pre-existing, transformed or constructed.
In another aspect the decoding method further comprises that said genomic reads are paired.
In another aspect the decoding method further comprises that said genomic data are entropy decoded.
In another aspect the decoding method further comprises the decoding of multiple alignments information comprising the steps of:
decoding the number of alignments of each read,
decoding the position of each alignment,
identifying the primary alignment by decoding the score associated to each alignment,
identifying if any secondary alignment has a different contiguity with respect to the reference sequence than the primary alignment by decoding the corresponding syntax elements.
In another aspect the decoding method further comprises the steps of:
Identifying if any coded read is split into two or more splices,
decoding the length of each splice,
decoding the mapping position of each splice.
In another aspect the decoding method further comprises that the coded genomic read are paired end reads and the steps of:
decoding the number of alignments of the right-most read associated with each alignment of the left-most read,
decoding the pairing information associating each alignment of the left-most read with one or more alignment of the right-most read.
In another aspect the decoding method further comprises that the coded genomic read are split into two or more splices and the steps of:
decoding the length of each coded splice,
decoding the mapping position of each splice.
The present invention further provides a genomic encoder (2810) for the compression of genome sequence data 289, said genome sequence data 289 comprising reads of sequences of nucleotides, said genomic encoder (2810) comprising:
an aligner unit (281), configured to align said reads to one or more reference sequences thereby creating aligned reads,
a constructed-reference generator unit (282), configured to produce constructed reference sequences,
a data classification unit (284), configured to classify said aligned reads according to specified matching rules with the one or more pre-existing reference sequences or constructed reference sequences thereby creating classes of aligned reads (288),
one or more descriptors stream encoding units (285-287), configured to encode said classified aligned reads as streams of syntax elements by selecting said syntax elements according to said classes of aligned reads,
a multiplexer (2816) for multiplexing the compressed genomic data and metadata.
In another aspect the genomic encoder further comprises:
A data classification unit (284) contains encoders of data classes N, M and I configured with vectors of thresholds generating sub-classes of data classes N, M and I.
In another aspect the genomic encoder further comprises the features suitable for executing all the aspects of the previously mentioned coding methods.
The present invention further provides a genomic decoder (298) for the decompression of a compressed genomic stream (291) said genomic decoder (298) comprising:
a demultiplexer (290) for demultiplexing compressed genomic data and metadata,
parsing means (292-294) configured to parse said compressed genomic stream into streams of syntax elements (295),
one or more syntax elements streams decoders (296-297), configured to decode the descriptors streams into classified reads of sequences of nucleotides (2911),
genomic data classes decoders (299) configured to selectively decode said classified reads of sequences of nucleotides on one or more reference sequences so as to produce uncompressed reads of sequences of nucleotides.
In another aspect the genomic decoder further comprises that the one or more reference sequences are stored in the compressed genome stream (291).
In another aspect the genomic decoder further comprises that the one or more reference sequences are provided to the decoder via an out of band mechanism.
In another aspect the genomic decoder further comprises that the one or more reference sequences are built at the decoder.
The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned coding methods.
The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned decoding methods.
The present invention further provides a support data storing genomic encoded according perform all the aspects of the previously mentioned coding methods.
The genomic or proteomic sequences referred to in this invention include, for example, and not as a limitation, nucleotide sequences, Deoxyribonucleic acid (DNA) sequences, Ribonucleic acid (RNA), and amino acid sequences. Although the description herein is in considerable detail with respect to genomic information in the form of a nucleotide sequence, it will be understood that the methods and systems for compression can be implemented for other genomic or proteomic sequences as well, albeit with a few variations, as will be understood by a person skilled in the art.
Genome sequencing information is generated by High Throughput Sequencing (HTS) machines in the form of sequences of nucleotides (a.k.a. “bases”) represented by strings of letters from a defined vocabulary. The smallest vocabulary is represented by five symbols: {A, C, G, T, N} representing the 4 types of nucleotides present in DNA namely Adenine, Cytosine, Guanine, and Thymine. In RNA Thymine is replaced by Uracil (U). N indicates that the sequencing machine was not able to call any base and so the real nature of the nucleotide at that position is undetermined. In case the IUPAC ambiguity codes are adopted by the sequencing machine as vocabulary, the alphabet used for the symbols is composed of the following symbols: {A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, N or -}.
In the context of the present invention a “Genomic Dataset” is defined as any structured set of genomic data including, for example, genomic data of a living organism, one or more sequences and metadata generated by the genomic sequencing of a living organism or by any other step of genomic data processing performed on the original sequencing data.
The nucleotides sequences produced by sequencing machines are called “reads”. Sequence reads can be composed of a number of nucleotides ranging between a few dozens to several thousand. Some sequencing technologies produce sequence reads composed of “pairs” of which one read is originated from one DNA strand and the other is originated from the other strand. A read associated to another read in a sequencing process producing pairs is said to be its “mate”.
Throughout this disclosure, a reference sequence is a sequence of nucleotides associated to a mono-dimensional integer coordinate system for which each integer coordinate is associated to a single nucleotide. Coordinate values can only be equal or larger than zero. This coordinate system in the context of this invention is zero-based (i.e. the first nucleotide has coordinate 0 and it is said to be at position 0) and linearly increasing from left to right.
When mapping sequence reads on a reference sequence, said reference sequence is used as axis of a mono-dimensional coordinate system in which the left-most position is denoted as position 0. In a sequence read, mapped to a reference sequence, the nucleotide composing the read mapped at the reference sequence position identified by the smallest coordinate number is usually referred to as the “left-most” nucleotide, whereas the nucleotide composing the read mapped at the reference sequence position identified by the largest coordinate number is referred to as the “right-most” nucleotide. This is illustrated in
When a sequence read is mapped to a reference sequence, the coordinate of the left-most mapped base is said to represent the mapping position of the read on the reference sequence.
A reference genome is composed by one or more reference sequences and it is assembled by scientists as a representative example of a species' set of genes. For example GRCh37, the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, N.Y. However, a reference sequence could also consist of a synthetic sequence conceived and merely constructed to improve the compressibility of the reads in view of their further processing.
Throughout this disclosure, a genomic record is defined as a coded representation of
Throughout this disclosure, a genomic record position is defined as the position on the reference sequence of the left-most mapped base of the read or read pair coded in a genomic record. Mapped bases in a read or read pair record include:
A base present in the aligned read and not present in the reference sequence (a.k.a. insertion) and bases preserved by the alignment process, but not mapped on the reference sequence (a.k.a. soft clips) do not have mapping positions.
The read composing a read pair with a base mapping on the smallest coordinate on a reference is referred to, in this disclosure, as “Read 1” whereas its mate is referred to as “Read 2”.
The distance, expressed as number of nucleotides (or bases), separating two reads generated as a pair, by a sequencing machine using today technology, is unknown, and it is determined by mapping both reads composing the pair (i.e. minimizing appropriate matching functions) to a reference sequence.
Throughout this disclosure, a genomic record length is defined as the number of coordinate positions between the left-most mapped base coded in the record and the right-most mapped base coded in the record.
Throughout this disclosure, a pairing distance is defined as the number of coordinate positions between the left-most mapped base coded in the record and the left-most mapped base of Read 2 coded in the record. An example of pairing distance is shown in
Throughout this disclosure, in case of single reads the genomic record length (GRL) is calculated as the integer obtained by subtracting the mapping position of the left-most base from the mapping position of the right-most mapped base and adding “1”.
GRL=(right-most base position)−(left-most base position)+1
Throughout this disclosure in case of read pairs, the genomic record length (GRL) is calculated as the integer obtained by subtracting the mapping position of the left-most base of the read mapping at the smallest position on a reference sequence (Read 1) from the coordinate of the mapping position of the right-most base of its mate (Read 2) and adding “1”. Such definition of genomic record length is illustrated in
Throughout this disclosure, in case of genomic record encoding a reference sequence or a portion thereof, the genomic record length is defined as the number of nucleotides composing the encoded sequence.
Throughout this disclosure, a genomic range is defined as a contiguous segment of coordinates on a reference sequence defined by a start coordinate S and an end coordinate E such that S≤E. The start and end position of a genomic range are always included in the Range.
This invention aims at defining a new method enabling the efficient access to aligned genomic sequence reads mapped to any genomic region when the sequence reads are compressed by means of sets of descriptors included into a number of data blocks called Access Units.
Throughout this disclosure, an Access Unit (AU) is defined as a logical data structure containing a coded representation of genomic information or related metadata to facilitate the bit stream access and manipulation. It is the smallest data organization that can be decoded by a decoding device implementing the invention described in this disclosure.
According to the type of coded information, an AU can be decoded either independently of any other AU or using information contained in other AUs.
AUs can be of a multiplicity of types according to the nature of the coded data. An Access Unit contains either a reference sequence, or a portion thereof, or encoded reads or read pairs belonging to a single class of data. For example an Access Unit may contain the entire chromosome 1 of GRCh37, the Genome Reference Consortium human genome (build 37). Another Access Unit may contain the coded representation of nucleotides of chromosome 1 of GRCh37 that are located between coordinates 50,000 and 150,000. Another Access Unit may contain only reads or read pairs that perfectly map on the reference sequence without any mismatch. Another Access Unit may contain reads or read pairs that only contain “N” symbols as mismatches with respect to the reference sequence. Another Access Unit may contain reads or read pairs that contain any type of substitutions (e.g. one base present in the read or read pair is different from the base at the corresponding mapping position in the reference sequence). Another Access Unit may contain reads or read pairs that contain mismatches, insertions, deletions and soft clipped bases. Another Access Unit may contain only read or read pairs that do not map on the reference sequence. Another Access Unit may contain only read pairs in which one read is mapped and the other is not mapped on the reference sequence. Another type of Access Unit may contain only encoded segments of a reference genome composed by one or more reference sequences (for example chromosomes).
The essential characteristic of an Access Unit is that it contains in compressed form all elements needed to reconstruct the genomic information (sequence reads or read pairs, reference sequences), associated alignment information and metadata of reads or read pairs it represents. In other words, to fully reconstruct the reads, read pairs or reference sequence and associated information carried by an Access Unit it is only necessary to retrieve the Access Unit itself and, when needed, the Access Units containing the reference sequence it refers to.
In each Access Unit the descriptors listed in the next section representing the encoded read or read pairs are aggregated in separate data blocks—one per type—in order to exploit their homogeneous statistical properties when entropy coding is applied to compress them.
Each Access Unit contains the compressed sub-set of descriptors representing sequence reads or read pairs belonging to the same class mapped to a genomic region on a reference sequence. Such genomic region on the reference sequence is defined by a start coordinate (or start position) and an end coordinate (or end position).
An example of Access Unit is illustrated in
Descriptors are syntax elements representing part of the information necessary to reconstruct (i.e. decode) coded reference sequences, sequence reads and the associated mapping information or pairs of sequence reads and associated mapping information. Different types of descriptors are defined to express:
The complete list and the precise definition of each descriptor referred to in this disclosure, and used by this invention is provided in the following sections.
As mentioned above, a multiplicity of descriptors are used by this invention to represent in compressed form reference sequences, sequence reads or read pairs so that they can be fully reconstructed with their associated information. In case reads or read pairs are also classified and separated into different classes, according to the results of the mapping on reference sequences, and entropy coded into separate data blocks, different sub-sets of descriptors are used for representing each class or reads or read pairs. Therefore, an Access Unit contains only those entropy coded descriptors that are needed to represent either a reference sequences—or portions thereof—or reads or read pairs belonging to the same class. This is shown in
Throughout this disclosure, entropy coded descriptors of the same type are said to compose a Descriptor Stream.
Throughout this disclosure, an Access Unit Start Position is defined as the left-most Genomic Record Position among all Genomic Records contained in the Access Unit.
Throughout this disclosure, an Access Unit End Position is defined as the right-most mapped base position among all mapped bases of all Genomic Records contained in the Access Unit
Throughout this disclosure, the Access Unit Range is defined as the Genomic Range comprised between the AU Start Position and the right-most genomic record position among all genomic records contained in the Access Unit. Its value in number of positions can be calculated by subtracting the AU Start Position from the AU End Position and adding “1”.
Throughout this disclosure, the Access Unit Covered Region is defined as the Genomic Range comprised between the AU Start Position and the AU End Position.
Throughout this disclosure, an Access Unit is also said to cover the genomic region between its Start Position and its End Position.
Some genomic records coded in an AU can have mapping position at a distance from the AU end position that is smaller than the genomic record length. This implies that some bases belonging to the read or read pair coded in the genomic record are mapped in genomic regions covered by one of the following AUs. This is shown in
According to the definitions provided above two useful ways of building Access Units can be identified:
The “non-overlapping mode” is preferable in scenarios when genomic data are compressed and stored in memory as files as well as in streaming scenarios when stored files are streamed from one storage device to another since it enables the implementation of efficient selective access procedures. The “overlapping mode” supports streaming scenarios when portions of the genomic datasets become available for coding into Access Units and transmission before the entirety of the genomic sequence data is available at the transmitting device.
The main innovative aspects of the disclosed invention are the following:
The MIT enables efficient random access to reads or read pairs mapped to specified genomic regions by associating genomic regions on a reference sequence with the corresponding locations on a storage device of the Access Units containing a compressed representation of said reads or read pairs.
In the following, each of the above innovative aspects will be further described in full details.
The sequence reads generated by sequencing machines are classified by the disclosed invention into six different “classes” according to the matching results of the alignment with respect to one or more “pre-existing” reference sequences.
When aligning a DNA sequence of nucleotides with respect to a reference sequence the following cases can be identified:
The classification specified in the previous section concerns single sequence reads. In the case of sequencing technologies that generates read in pairs (i.e. Illumina Inc.) in which two reads are known to be separated by an unknown sequence of variable length, it is appropriate to consider the classification of the entire pair to a single data class. A read that is coupled with another is said to be its “mate”.
If both paired reads belong to the same class the assignment to a class of the entire pair is obvious: the entire pair is assigned to the same class for any class (i.e. P, N, M, I, U). In the case the two reads belong to a different class, but none of them belongs to the “Class U”, then the entire pair is assigned to the class with the highest priority defined according to the following expression:
P<N<M<I
in which “Class P” has the lowest priority and “Class I” has the highest priority.
In case only one of the reads belongs to “Class U” and its mate to any of the Classes P, N, M, I a sixth class is defined as “Class HM” which stands for “Half Mapped”.
The definition of such specific class of reads is motivated by the fact that it is used for attempting to determine gaps or unknown regions existing in reference genomes (a.k.a. little known or unknown regions). Such regions are reconstructed by mapping pairs at the edges using the pair read that can be mapped on the known regions. The unmapped mate is then used to build the so called “contigs” of the unknown region as it is shown in
The table below summarizes the matching rules applied to reads in order to define the class of data each read belongs to. The rules are defined in the first five columns of the table in terms of presence or absence of type of mismatches (n, s, d, i and c type mismatches). The sixth column provide rules in terms of maximum threshold for each mismatch type and any function f(n,$) and w(n,s,d,i,c) of the possible mismatch types.
A common element of efficient approaches to genomic sequence reads compression is the exploitation of the correlation of sequence data with respect to a reference sequence. Even if the somatic profile of the human population is extremely diversified, the actual portion of the number of nucleotides that differs from person to person is about only 0.1% of the total number of nucleotides composing an entire genome. Therefore, the specific genomic information characterizing each individual is very limited with respect to the entire information carried by an entire genome. When a pre-existing reference genome is available, be it for previous sequencing or as a published “average” consensus reference, the most efficient way to encode the actual information is to identify and encode only the differences with respect to the reference genome.
In order to do so with raw sequence reads in the form of FASTQ data, a preliminary pre-processing step of mapping on an available reference genome is performed. In case a reference genome is not available or if the bias introduced by the usage of a specific reference is not desirable, the construction of a new reference sequence by means of assembling the available sequence reads into longer sequences is a possible alternative.
When sequence reads have been mapped with respect to a pre-existing or to a constructed reference sequence, each sequence read can be fully represented by a number of elements denoted in this disclosure as “read descriptors” or simply “descriptors”.
For example, in the case of a sequence read that perfectly matches a segment of a reference sequence, the only sub-set of descriptors needed to represent the sequence read is composed by the coordinate of the mapping position on the reference (usually the coordinate of the mapping position of the left-most base of the sequence read), the length of the sequence read itself and the information indicating if the read is mapping on the direct or reverse DNA strand with respect to the reference sequence strand.
In case it is not possible to find any mapping position for which all the bases of the sequence read match all bases of the reference sequence, the mapping (or mappings) with the minimal number of mismatches are retained. In such case a different sub-set of descriptors is needed to also express the substitutions, the insertions, the deletions and the clipped bases that may occur in correspondence of the mapping position with the minimal or close to minimal number of mismatches. With such sub-set of descriptors the sequence reads can be reconstructed using the information carried by the descriptors and the information carried by the reference sequence.
The mapping process can also produce other types of information such as: multiple possible mapping positions and related scores, the quality of mapping, the specification of spliced reads, the mapping on two different references (usually chromosomes) of reads belonging to a pair, features of the sequencing process (e.g. PCR or optical duplicate). All such information requires specific additional descriptors that extend each sub-sets that then is compressed by applying for each sub-set of descriptors appropriate entropy coding algorithms.
The genome sequencing process can generate read duplicates (i.e. two or more exact copies of the same genomic sequence) due to:
Each read or read pair can therefore be uniquely represented by a specific sub-set of descriptors according to the results of the mapping process.
Commonly used approaches such as SAM and CRAM do not encode reads or read pairs according to the specific sub-set of descriptors needed to represent their mapping information. SAM and CRAM do not classify sequence reads into data classes according to the number and type of mismatches they contain with respect to the reference sequence they are mapped to. Furthermore, these formats do not code sequence reads separately into Access Units containing in compressed form only sequence reads belonging to a single data class. In the case of sequence reads generated in pairs, state of the art approaches do not code them as single elements partitioned into classes according to their mapping accuracy with respect to the reference sequence. Such state of the art approaches are characterized by the following limitations and drawbacks:
When coding read pairs by means of a single sub-set of descriptors, the following technical advantages are evident to a person skilled in the art:
In order to enable efficient selective access to specific portions of sequencing data and being able to transport them on a digital data network, the set of descriptors used to represent sequence reads aligned to a reference are structured in logically separate and independent data blocks called Access Units (AU). Each Access Unit contains only the compressed representation of a single data class, and can be decoded either independently of any other Access Units or using only Access Units carrying the coded representation of the reference sequence region used for mapping. This enables selective access and out-of-order transport capabilities.
In order to increase the compression efficiency this invention eliminates the need of specifying the “mapping reference identifier” descriptor for each read pair having both pairs mapped on the same reference sequence. Each Access Unit can contain only reads or pairs that map on the same reference. Using such solution the descriptor representing the reference sequence identifier needs to be encoded only once per each Access Unit or set of Access Units (and not repeated for each read as currently done in SAM/BAM formats).
The only exception of the rule expressed above is the case of read pairs having the two reads mapped on different reference sequences (e.g. chromosomes). In this case the pair is split and the two reads are coded as two separate genomic records and each encoded read contains the identifier of the reference sequence its mate is mapped to.
A person skilled in the art knows that classifying information in groups of elements with homogeneous statistical properties provides better compression performance with respect to the usage of a general purpose compressor (e.g. LZ type algorithm) applied to a heterogeneous set of data. As a consequence, when encoding genomic sequence reads in pairs by means of a specific sub-set of descriptors, higher compression is achieved thanks to the lower entropy characterizing each separate sub-set of descriptors and higher processing efficiency when reconstructing and retrieving read pairs.
This section introduces the descriptors specified to represent genomic sequence reads mapped on a reference sequence. The specific sub-set of descriptors used to represent each read or read pair depends on the result of the mapping versus a reference sequence (i.e. presence or absence of mismatches between the read, or read pair and the reference sequence).
A read or read pair position is defined as the mapping position on the reference sequence of the left-most base in the read or read pair. A descriptor of type “position” is necessary per each read or read pair. The value of the “position” descriptor represents:
A “position” descriptor is required to represent each encoded read or read pair.
In this invention disclosure such descriptor will be referred to as pos descriptor.
In case of read pairs, the descriptor that represents how each read is associated with its mate in the pair can be expressed by several syntax elements as it may represent:
In this invention disclosure such descriptor will be referred to as abspair descriptor. In case of mate mapped on a different reference sequence the descriptor identifying the reference sequence is referred to as refid
In the case of reads with variable length, a descriptor per read is used to represent the length expressed as the number of nucleotides composing the read. Obviously, a read length descriptor is required per each read only in the case of variable read length.
In this disclosure this descriptor is also referred to as rlen descriptor.
The DNA is composed by a double helix in which each strand is the complement of the other since Adenine (‘A’) only couples with Thymine (‘T’) and Cytosine (‘C’) only couples with Guanine (‘G’). Therefore, it is only necessary to represent one strand to know the nucleotide composition of the other. This is the reason for which reference sequences are always represented by a single sequence, and mapping tools are capable of finding mapping positions for reads belonging to both strands. If a read is mapped on the complementary strand of the DNA helix, it is said to be “reverse complemented”. A descriptor is necessary to carry such information and carries the information indicating if the original read is the reverse complement or not of the reference sequence it is mapped to.
A reverse complement descriptor is required per each read.
In this disclosure such descriptor is also referred to as rcomp descriptor.
Unknown Bases Positions During the sequencing process it may occur that the machine is unable to call any base at a given positions of a read or read composing a pair. Such event is identified by a special symbol ‘N’ at the corresponding read position. A descriptor identifying each occurrence of an ‘N’ symbol in a read is thus needed.
The descriptor may represent:
In this disclosure such descriptor is also referred to as nmis descriptor.
Sequence reads mapped on a reference sequence may present mismatches with respect to the reference sequence segment they are mapped to. These mismatches can be classified and are denoted as substitutions, deletions or insertions according to the following cases:
The representation of each mismatch type implies the use of three descriptors, one representing the mismatch position in the read or read pair (also referred to as mmpos), another representing the type of mismatch when only substitutions are present (also referred to as subtype) and another representing the type of mismatch when substitutions, insertions and deletions are present (also referred to as mmtype).
Genomic sequence reads mapped on a reference sequence may present at their edges portions of the sequence of nucleotides that do not match any or just a few of those present on the reference sequence at the mapping position. These sequences portions are called soft-clips and can be represented by a descriptor simply composed by the string of symbols representing the bases of the sequence portion.
Reads can admit only one or two soft clips, at the beginning of the read and/or at the end.
In this document such descriptor is also referred to as sclips descriptor.
Mapping flags are used to carry specific information relative to the alignment process such as:
In this document such descriptors are also referred to as flags descriptors.
In case a read is not mapped at any position of the reference sequences the read is classified as unmapped. In such case all unmapped reads are grouped according to some shared characteristics. This process is called “clustering”. The groups of reads sharing the same characteristic are called clusters. Throughout this invention disclosure the characteristic shared among sequence reads belonging to the same cluster is called cluster signature or signature.
A signature can be composed by any number of nucleotides from two to several thousands and signatures can have either constant length for all clusters or variable length. The alphabet of symbols that can belong to a signature depends on the specific genomic sample that has been sequenced to produce the sequence reads being processed. As an example, but not as a limitation, the following alphabets can be used:
The type of alphabet used to compute a cluster signature is identified by a parameter Alphabet_ID carried by a data structure called Genomic Dataset Header described in this disclosure.
Signatures of clusters belonging to the same genomic dataset can be of constant or variable length. A global parameter encoded in the genomic dataset header is used to signal if the signatures length is constant or variable. If the signature length is constant a second global parameter represents the length in symbols of the clusters signatures. This value is 0 in case of variable signatures length. Each cluster of unmapped reads is coded in one or more Access Units.
Encoding Reads or Read Pairs without Mismatches
In case a read or read pair perfectly maps on the reference sequence (i.e. without any mismatch) the following sub-set of descriptors is needed to reconstruct the read and associated mapping information:
In this invention such read or read pair is classified as belonging to Class P.
The position descriptor pos represents the position on the reference genome of the left-most mapping base of the read or read pair. It's use is depicted in
The reverse complement descriptor rcomp indicates whether a read is mapped on the direct or reverse strand of the reference sequence. The meaning and syntax of this descriptor are depicted in
In case of variable length reads a descriptor rlen encodes the read length.
The pair descriptor carries the information necessary to reconstruct the entire pair. The syntax of the descriptor is provided in
An example of the encoding of a read pair belonging to Class P is provided in
Encoding Read or Read Pairs with Mismatches Represented Only by Unknown Bases
In the case a read or read pair maps on the reference sequence, but contains at least one unknown base, the following sub-set of descriptors is needed to reconstruct the read and associated mapping information:
Descriptors already present in Class P sub-set have the same syntax and semantics. The additional descriptor nmis provides the position in a read (pair) of the bases called as “unknown” by the sequencing process (symbol ‘N’).
In this invention such read or read pair is classified as to belonging to Class N.
An example of the encoding of a read pair of Class N is provided in
Encoding Read or Read Pairs with Unknown Bases and Substitutions
In the case a read or a read pair maps on the reference sequence and presents at least one substitution, but no deletions or insertions, the following sub-set of descriptors is needed to reconstruct the read and associated mapping information:
Descriptors already present in class P sub-set have the same syntax and semantics. The additional descriptors used for such sequence read data class are mmpos to represent the position of substitutions and subtype to represent the type of substitution.
An example of the encoding of a read pair of this type is provided in
In this invention disclosure such read or read pair is said to belong to Class M.
Encoding Read or Read Pairs with at Least One Insertion, Deletion or Soft Clip
When a read or read pair maps on the reference sequence with at least one insertion deletion or soft clip, the following sub-set of descriptors is defined:
Descriptors already present in class M sub-set have the same syntax and semantics. The additional descriptors used in this case are mmpos to represent the position of substitutions, insertions and deletions, mmtype to represent the type of mismatch (substitutions, insertion or deletion) and sclips to represent the soft clipped bases.
In this invention disclosure such read or read pair is said to belong to Class I. An example of the encoding of a read pair belonging to Class I is provided in
Read Pairs in which Only One Read is Mapped to a Reference Sequence
In the case a read pair is composed by a mapped read (belonging to any of the class P, N, M or I) and an unmapped read, the pair is classified as belonging to a separate class called Class HM (Half Mapped).
The read mapped on the reference sequence can be of any of the classes described above (P, N, M and I) and will be encoded using the sub-set of descriptors already described for each class. The unmapped read will be encoded by compressing the string of symbols representing it using an appropriate entropy coder.
Reads belonging to Class U or the unmapped mate of read pairs belonging to Class HM cannot be mapped to any “existing” reference sequence satisfying the specified set of matching accuracy constraints. This invention discloses a method to construct one or more “internal” reference sequences to be used to align and compress reads belonging to these data classes.
Several approaches are possible to construct appropriate “internal” references such as for instance and not as limitation:
If the read being coded can be mapped on the “internal” reference satisfying the specified set of matching accuracy constraints, the information necessary to reconstruct the read after compression is coded using syntax elements that can be of the following types:
The specific type of padding strategy will be signaled by special values in the indc descriptor stream header
In case reads of class U present variable length, an additional descriptor rlen is used to store each read length.
This coding approach can be extended to support N start positions per read so that reads can be split over two or more reference positions. This can be particularly useful to encode reads generated by those sequencing technology (e.g. from Pacific Bioscience) producing very long reads (50K+ bases) which usually present repeated patterns generated by loops in the sequencing methodology. The same approach can be used as well to encode chimeric sequence reads defined as reads that align to two distinct portions of the genome with little or no overlap.
The approach described above can be clearly applied beyond the simple class U and could be applied to any stream containing syntax elements related to reads positions (pos streams).
Reads and Read Pairs with Multiple Mapping Positions
The case of a read or a reads pair being mapped to multiple coordinates of the reference sequence is supported by state of the art approaches like SAM or CRAM by replicating the encoded data records and by the ad-hoc additions of optional fields, with a consequent scattering in the data representation leading to an evident loss of compression efficiency due to the introduced redundancy. Moreover, some mapping configurations of paired-end reads whereby one read is paired to multiple mapping positions of the respective mate are supported only by user-defined optional fields. The same can be said for the representation of reads and read pairs which need to be split into two or more sub-sequences in order to find appropriate mapping on the reference sequence. The main results of such approaches is a proliferation of inefficient representations in terms of compression efficiency with also a major impact on data reusability and loss of information when passing from one partial representation to another.
The method described in this disclosure supports the representation of both multiple mapping positions and spliced reads and results much more efficient than existing solutions both in terms of compression and compressed data accessibility. Better compression is provided by the possibility of grouping descriptors with homogeneous statistical properties and different entropy coders with appropriate contexts. Better file accessibility is provided by the definition of indexing mechanisms that enable the decompression and retrieval of specific type of genomic data only without the need to decompress and access the entire coded information. For example the invention described in this disclosure provides indexing mechanism enabling the retrieval of sequence reads or reads pair which have multiple mapping position with or without spliced reads. This is currently not possible with state of the art genomic information representation formats such as SAM and CRAM.
When encoding read pairs with multiple alignments, SAM and CRAM lack the possibility to support the representation of multiple alignments of one read in a pair associated with a single alignment of the other read. A person skilled in the art knows that this is a frequent case in experiments such as RNA-seq and ChIP-seq and today SAM and CRAM have no way to support all the possible combinations of coupling among multiple alignments of reads in a pair. The solution proposed in this disclosure is able to support all the possible configuration today found in genomic data generated by High-Throughput Sequencing (HTS) machines.
When mapping genomic sequence reads to a reference sequence the following outcomes are possible:
Case 1 simply requires that the uniqueness of the alignment is encoded.
Case 2 need to acknowledge that a “primary” or a “secondary” alignment does not exists since all mappings are equally probable. The only additional information to be encoded is the vector of all mapping positions.
Case 3 requires that all the mapping positions are encoded as an array of encoded reads. All required descriptors disclosed in this invention have to be replicated per each mapping position when needed. Different mapping positions can present different levels of errors (substitutions, indels, clipped bases) with respect to the reference.
A spliced read is defined as a sequence read that needs to be split into two or more sub-reads in order to find suitable mapping regions on a reference sequence. In this case, the distance among the sub-reads (called “junction”) is usually too large to be considered a deletion. The mapping of a spliced read can refer to the direct or reverse strand of the reference sequence, therefore this information, called “absolute strandedness” has to be preserved and encoded.
The encoding of spliced reads requires the preservation of the mapping position of each splice, which has to be considered as a variable length read, encoded in a genomic dataset containing constant length reads only.
In the following description the term template is used as in the SAM specification and identifies a nucleotides sequence part of which is sequenced on a sequencing machine or assembled from raw sequences. According to the sequencing technology used, the sequencing of a template can produce either a single sequence of nucleotides (one read) or two sequences which are said to be “paired”. In this context, a segment is defined as a contiguous sequence or subsequence.
When encoding reads or read pairs with multiple alignment position, the information to be conveyed by the encoded data is the following:
In this invention disclosure, the coding of multiple mapping positions and spliced reads is supported by:
If no splices are present in the dataset, the global sr flag is unset and the splen descriptor is not used. In paired-end sequencing, the mmap descriptor is structured into one value N followed by one or more numbers (Mi), with i going from 1 to the number of complete first (the leftmost) read alignments (N1). If no splices are present for that read, then N1=N. For each first read alignment, spliced or not, one value Mi is used to signal how many segments are used to align the second read (in this case equal to the number of alignments), and then how many values of the pair descriptor are coded for that alignment of the first read.
The values of Mi shall be used to calculate P=Σi=1N Mi which indicates the number of alignments of the second read.
This method is illustrated in
A special value of Mi=0 indicates that the ith alignment of the left-most read is paired with an alignment of the right-most read that is already paired with a kth alignment of the left-most read with k<i (then there is no new alignment detected, which is consistent with the equation above).
As an example, in the simplest cases:
When Mi is equal to 0 the associated value of the pair descriptor points to an existing second read alignment.
Multiple Alignments with Splices
In case of spliced reads, the splen descriptor is used. It is composed for each record of an array of N+P values. The first N values indicate the length of each aligned segment of the single read or the first (e.g. the leftmost) read of a pair. The following P values indicate the length of each aligned segment of the second read, in case of paired-end sequencing. P is calculated as calculate P=Σi=1N Mi where each value of Mi applies to each individual alignment of the first read to compose an alignment of the entire template.
The first N values of the splen descriptor for a record enable the calculation of N1, which represents the number of alignments of the first read. If N1=N no splices are present for the first read.
The following P values of the splen descriptor for a record enable the calculation of N2, which signals the number of alignments of the second read. If N2=P no splices are present for the second read.
The mmap and the splen descriptors defined enable the unique identification of how many reads or read pairs represent the multiple mappings and how many segments are composing each read or read pair mapping. This is shown in
The mmscore descriptor allows signaling the mapping score of an alignment. In single-end sequencing it has N1 values per template; in paired-end sequencing it has a value for each alignment of the entire template. In other words every pair composed by one alignment of read 1 and one alignment of read 2 can have an associated score. In case of paired end reads the number of total scores is calculated as
N.scores=MAX(N1,N2)+M0
where N1 is the number of total alignments of read 1, N2 is the number of total alignments for read 2 and the number of Mi values that are equal to 0.
In case of single reads the number of scores equals to N1
Multiple Alignments without Splices Descriptors
The table below summarizes the semantic and effect of the use of mmap and mmscore descriptors defined in this invention disclosure in case of multiple alignments without spliced reads.
Multiple Alignments with Splices Descriptors
The table below summarizes the semantic and effect of the use of mmap and mmscore descriptors defined in this invention disclosure in case of multiple alignments with spliced reads.
It may happen that the alignment process finds alternative mappings to another reference sequence than the one where the primary mapping is positioned.
In this case it is of paramount importance, in terms of applications, to maintain a fast (in terms of random access complexity) link between the two or more, necessarily different, Access Units where the several alternative mappings for a template are coded.
For read pairs that are uniquely aligned, this invention disclosure defines a descriptor named pair used to represent a chimeric alignment where the two reads in a pair are mapped on different chromosomes. This descriptor can be used to signal the reference and the position of the next record containing further alignments for the same template. This is shown in
In case one or more alignments for the leftmost read in a pair are present on a different reference sequence than the one related to the currently encoded AU, a reserved value of the pair descriptor shall be used (not the same as the one used for alignments present to another reference in case of unique alignment). The reserved value shall be followed by the reference and position of the leftmost alignment among all those contained in the next AU (i.e. the first decoded value of the pos descriptor for that record).
Multiple Alignments with Insertions, Deletions, Unmapped Portions
In some cases multiple alignments can present different configurations of matching and mismatching bases, insertions, deletions and soft clips. For example it is not infrequent that while a primary alignment only has matching or mismatching bases and it is therefore mapped as a contiguous sequence of nucleotides, secondary alignments may present insertions, deletions, soft clips or splices. Throughout this disclosure, when a mapped read does not contain insertions, deletions or soft clipped bases is said to have mapping contiguity. This invention disclosure defines a third descriptor signaling whether the secondary alignment belongs to a data class which preserves the same mapping contiguity of the primary alignment (like if it were P, M or N) or not (I, U, spliced). This descriptor, mmsc (for multiple mapping sub-class) is in principle only a flag per alignment. When mmsc is set, it is followed by the verbatim representation of the SAM cigar string generated by the aligner extended with an additional symbol “U” representing unmapped nucleotides and followed by the string of unmapped nucleotides.
A reference sequence is commonly represented as a string of symbols representing the nucleotides that can be found in the corresponding biological sample. In the case of DNA the nucleotides are four and represented by the symbols A, C, G and T. In the case of RNA the T is replaced by U. A fifth symbol is added to denote a coordinate in the sequence where the sequencing device was not able to determine the type of nucleotides according to the confidence required by the experiment. In this invention disclosure a reference sequence can be encoded either entirely in one Access Unit or partitioned into two or more sub-sequences.
The descriptor used to represent reference sequences or sub-sequences to be entropy coded is the verbatim representation of the sequence or sub-sequence in terms of the allowed symbols of the respective alphabet.
For each data class, sub-class and associated descriptor stream of the genomic data structure disclosed in this invention, different coding algorithms may be adopted according to the specific features of the data or metadata carried by each stream and its statistical properties. The “coding algorithm” has to be intended as the association of a specific “source model” of the descriptor stream with a specific “entropy coder”. The specific “source model” can be specified and selected to obtain the most efficient coding of the data in terms of minimization of the source entropy. The selection of the entropy coder can be driven by coding efficiency considerations and/or probability distribution features and associated implementation issues. Each selection of a specific “coding algorithm”, also referred to as “coding mode” can be applied to an entire “descriptor stream” associated to a data class or sub-class for the entire data set, or different “coding modes” can be applied for each portion of descriptors partitioned into Access Units.
Each “source model” associated to a coding mode is characterized by:
The classification of sequence data into the defined data classes and sub-classes permits the implementation of efficient coding modes exploiting the lower information source entropy characterizing by modelling the sequences of syntax elements by single separate data sources (e.g. distance, position, etc.).
Another advantage of the invention is the possibility to access only the subset of type of data of interest. For example one of the most important application in genomics consists in finding the differences of a genomic sample with respect to a reference (SNV) or a population (SNP). Today such type of analysis requires the processing of the complete sequence reads whereas by adopting the data representation disclosed by the invention the mismatches are already isolated into one to three data classes only (depending on the interest in considering also “n type” and “i type” mismatches). A further advantage is the possibility of performing efficient transcoding from data and metadata compressed with reference to a specific “external” reference sequence to another different “external” reference sequence when new reference sequences are published or when re-mapping is performed on the already mapped data (e.g. using a different mapping algorithm) obtaining new alignments.
Class decoders 299 are able to reconstruct the original genomic sequences by leveraging the information on the original reference sequences carried by one or more genomic bitstreams. In case the reference sequences are not transported by the genomic streams they must be available at the decoding side and accessible by the class decoders.
The inventive techniques herewith disclosed may be implemented in hardware, software, firmware or any combination thereof. When implemented in software, these may be stored on a computer medium and executed by a hardware processing unit. The hardware processing unit may comprise one or more processors, digital signal processors, general purpose microprocessors, application specific integrated circuits or other discrete logic circuitry.
The techniques of this disclosure may be implemented in a variety of devices or apparatuses, including mobile phones, desktop computers, servers, tablets and similar devices.
Throughout this invention disclosure, the genomic data classified in data classes and structured in compressed or uncompressed layers are organized into different Access Units as defined above. Access Units are differentiated by:
Access units of any type can be further classified into different “categories”.
Hereafter follows a non-exhaustive list of definition of different types of genomic Access Units:
Access Units of type 0 are ordered (e.g. numbered), but they do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming, multiplexing).
Access Units of type 1, 2, 3, 4, 5 and 6 do not need to be ordered and do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming).
Each Access unit can have a different number of packets in each block, but within an Access Unit all blocks have the same number of packets.
Each data packet can be identified by the combination of 3 identifiers X Y Z where:
Access Units of any type can be further classified and labelled in different “categories” according to different sequencing processes. For example, but not as a limitation, classification and labelling can take place when
When storing coded descriptors on a storage medium, two are the approaches described in this invention disclosure:
When AUC is applied encoded data blocks belonging to the same Access Unit (but different Descriptors Streams) are stored in contiguous areas of the storage medium.
The AUC mode can be implemented in two different ways in terms of data storage:
The AUC/CC mode method is more efficient when accessing data of a single class. The AUC/GRC mode is more efficient when accessing data of any class mapping to the same genomic region. The invention described in this disclosure and the related syntax supports all modes DSC, AUC/CC and AUC/GRC methods and leaves the encoder with the freedom to select any mode according to the desired selective access performance. If AUC/CC mode or AUC/GRC mode is used is signaled by a flag named CC_Mode_Flag carried by the Genomic Dataset Header as listed in Table 2.
When DSC is applied blocks belonging to the same Descriptors Stream are stored in contiguous areas of the storage medium. Genomic data are in fact organized per Descriptors Streams (composed by one or more Blocks), representing homogeneous data in terms of entropy encoding.
The storage alternative used in a genomic dataset among the coding methods is signaled by flags referred to as Block_Header_Flag stored in the Genomic Dataset Header as listed in Table 2.
The difference between the AUC and DSC modes is illustrated in
In order to support selective access to specific regions of the aligned data, two data structures are described in this disclosure: a Genomic Dataset Header carrying global parameters and the indexing tool called Master Index Table (MIT) used during the encoding and decoding process. The syntax of the Genomic Dataset Header is provided in Table 2 and the syntax of the Master index table is provided in Table 3.
The Genomic Dataset Header is a data structure carrying global parameters used by the encoder and the decoder to manipulate the encoded genomic information.
The information carried by the Genomic Dataset Header includes:
The syntax and semantics of each element of the Genomic Dataset Header are listed in Table 2 below.
An indexing tool called Master Index Table (MIT) is disclosed in this invention.
The Master Index Table (MIT) is a data structure based on multi-dimensional array containing:
The last section of the MIT contains two alternative sections that are used according to the presence of headers prepended to each coded Block of descriptors. If Block headers are present (Block_Header_Flag set) the MIT contains the size in byte of each Descriptor Stream. If Block headers are not present (Block_Header_Flag unset) the MIT contains the size in bytes of each Access Unit. The alternative between the two coding methods is signaled by the flag called Block_Header_Flag in Table 2.
In case of presence of multiple alignments, the MIT introduced above is replicated to provide an indexing tool which takes into consideration the multiple alignments of reads or reads pairs coded into the Access Unit. The extended Master Index Table contains
Number | Date | Country | Kind |
---|---|---|---|
PCT/EP2016/074297 | Oct 2016 | EP | regional |
PCT/EP2016/074301 | Oct 2016 | EP | regional |
PCT/EP2016/074307 | Oct 2016 | EP | regional |
PCT/EP2016/074311 | Oct 2016 | EP | regional |
PCT/US2017/017841 | Feb 2017 | US | national |
PCT/US2017/017842 | Feb 2017 | US | national |
This application claims priority to and the benefit of Patent Applications PCT/EP2016/074311, PCT/EP2016/074301, PCT/EP2016/074307, PCT/EP2016/074297, PCT/US2017/17842, PCT/US2017/17841.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US17/41591 | 7/11/2017 | WO | 00 |