The present invention relates to the field of data compression of MPEG-G.
MPEG, Moving Picture Experts Group (MPEG) is a working group of data compression experts that was formed by ISO and IEC to set standards for audio and video compression and transmission.
This group has been developing standards for video efficient video compression since the early 90ies.
The technology of MPEG essentially consists into the reduction of the entropy of the video and audio source data such that higher compression ratio can be achieved for efficient storage and transmission.
Since there is a great expertise of data compression within the MPEG groups of expert, it was decided to develop a standard for the compression of genomic information to overcome the limitations of the solution present in the art (e.g. CRAM and BAM file formats).
Therefore, even if MPEG-G relates to the compression of genomic data, the main idea of exploitation of data redundancies is taken from the field of video and audio compression that is the closest technical field to the present application.
This invention in fact applies syntax elements construction for genomic data in a similar manner as the syntax elements are applied to the compression of video and audio data in MPEG.
Given the fact, however, that the genomic data are quite different from the audio and video data, the data classification and the syntax elements are different from those used in the MPEG video and audio standards: in fact the redundancies present in the genomic data have to exploited and these are different from the multimedia data.
The present invention therefore deals with the compression of genomic data in an efficient manner in order to obtain a file of reduced size and easy to be randomly accessed also in the compressed domain.
The present invention builds onto the encoding and decoding methods, systems and computer programs disclosed in the patent applications WO 2018/068827A1, WO 2018/068828A1, WO 2018/068829A1, WO 2018/068830A1, whose disclosures related to entropy coding of genomic data may be essential for the understanding of some aspects of the present invention; the disclosure of the aforementioned documents is therefore considered as incorporated by reference in the present invention.
This disclosure provides a novel method of representation of annotations and metadata associated to genome sequencing data which reduces the utilized storage space, provides a single syntax for several metadata formats and improves data access performance by providing new indexing functionality which is not available with known prior art methods of representation.
The method disclosed in this invention provides higher compression ratios for genome sequencing data and associated annotations by:
The advantage of compressing separately non-indexed descriptors from indexed textual descriptors is that these 2 classes of data, once separately grouped, show a lower entropy than when they are coded together, therefore higher compression ratio can be achieved.
By using compressed full-text string indexing algorithms, the method described in this invention eliminates the need to have both a compressed payload of genomic information and an index of said information to support selective access, therefore reaching better compression ratios. The compressed full-text string indexing algorithms is at the same time an index and the compressed information and can be used both to perform selective access and to retrieve the desired information by decompression. This invention overcomes the need to have both an index and a compressed payload as currently required by existing solutions in the art.
The method also allows to hierarchically describe, and store in compressed form, concepts related to genomic annotation which were previously unrelated. This makes it possible to encode relations between such concepts that could not be described previously, thus allowing novel ways of describing and interchanging data.
Genomic or proteomic information generated by DNA, RNA, or protein sequencing machines is transformed, during the different stages of data processing, to produce heterogeneous data. In prior art solutions, these data are currently stored in computer files having different and unrelated structures. This information is therefore quite difficult to archive, transfer and elaborate.
The genomic or proteomic sequences referred to in this invention include, for example, and not as a limitation, nucleotide sequences, Deoxyribonucleic acid (DNA) sequences, Ribonucleic acid (RNA), and amino acid sequences.
Sequence alignment refers to the process of arranging sequence reads by finding regions of similarity that may be a consequence of functional, structural, or evolutionary relationships among the sequences. When the alignment is performed with reference to a pre-existing nucleotides sequence referred to as “reference sequence”, the process is called “mapping”. Prior art solutions store such information in “SAM”, “BAM” or “CRAM” files. The process of performing sequence alignment is also referred to as “aligning”.
The concept of aligning sequences to reconstruct a partial or complete genome is depicted in
It exist a clear need to provide an appropriate genomic sequencing data and metadata representation (Genomic File Format) by organizing and partitioning the data so that the compression of data and metadata is maximized and several functionality such as selective access and support for incremental updates and other data handling functionality useful at the different stages of the genome data life cycle are efficiently enabled.
Moreover, when genome sequencing data generated by high throughput sequencing machines is analyzed by processing pipelines and analysts, annotations of different regions of the genome, expressing a number of diverse properties, are generated and currently represented by heterogeneous textual formats. Even though different types of generated results and annotations are conceptually related to each other and ideally need to be jointly accessed and used, the current solutions used in the art are such that these metadata are in the form of independent and separated text files and separated from the coded data related to the genomic reads. These formats do not support any type of linkage between the elements of one file with the elements of other files which are conceptually linked and thus may share a common biological meaning.
In the best case, such lack of explicit connection implies that processing and using genomic data and annotation information, requires time-consuming and overly inefficient parsing of possibly large text files when searching for specific information and associated metadata. In the worst case, the fact that it is not possible to describe connections, hampers the development of effective bioinformatics workflows and databases for downstream applications such as biomedical research or personalized medicine.
For example, RNA-sequencing reads, aligned onto a gene (which is typically composed by a set of intervals on a reference genome), need to be counted in order to measure the degree of expression of the gene in the biological condition used for the experiment. Different biological conditions (producing different sets of reads generated by different experiments) are usually compared in the context of specific experiment aiming at finding paths linking genotypes to phenotypes. The process of generating and aggregating information related to single reads and their alignments to a reference genome into results with a more general genetic and biological meaning, is referred to as “secondary analysis”.
Different types of annotations (meta-information) generated by secondary analysis using genome sequencing reads, can be conceptually associated to the genome sequencing reads aligned to one or more intervals of the genomic sequences used as references.
A genomic interval can be uniquely identified by specifying a sequence of nucleotides in the reference assembly (i.e. a chromosome in a genome, a gene, set of contiguous bases, a single base, ...), the molecule strand which can be forward or reverse, and a start and an end positions specifying the range of bases (a.k.a. nucleotides) included in the interval.
Features associated with a genome interval such as variants, the number of aligned reads at a given position (also denoted as “coverage”), portions of the genome binding to proteins, nature and position of genes and regions associated to specific genetic functions can be uniquely identified and associated to genomic intervals. An interval can be as short as a single base, or it can span several thousand nucleotides or more.
A large number of integrated experiments can build a complex analysis of genome sequencing data. A different sequencing-derived protocol usually characterizes each experiment, it is used in order to sample a different function or compartment of the cell. The results produced by primary analysis (i.e. alignment of the reads with respect to a reference) and secondary analysis (i.e. integration and statistical studies performed on the results of the alignment) in each experiment can be visualized in graphical form using software applications called genome browsers, enabling one-dimensional navigation of the genome along the positions of nucleotides. The information resulting from secondary analysis associated to each position in the genome or to each interval is usually visualized in the form of different plots (or “tracks”) per sequencing experiment, representing the presence and structure of transcripts, sequence variants in an individual or a population, coverage of sequencing reads, intensity of protein binding to each position of the genome.
State of the art genome annotation formats produced by analysis tools represent all the aforementioned results - also referred to as “features” - using a number of heterogeneous and independently defined and maintained formats. Such formats are usually characterized by poor and inconsistent syntaxes and semantics, which generate a proliferation of slightly different and incompatible file formats for each type of analysis result. The drawback of all the currently existing solutions is that the scientists working on integrative analysis of genomic data are forced to systematically transcode the different formats by using complex concatenations of text-processing tools and programs when sets of experiments need to be jointly accessed and studied. Such proliferation of different formats results in poor interoperability and reproducibility of results across different groups of scientists using even only slightly different representations and associated semantics.
The formats most used to represent genome annotations generated by genome sequencing data analysis and used in the art are:
In order to solve the above problems of the existing prior art, the subject-matter of claims 1, 9, 12, 14 and 16 is proposed. Advantageous modifications are indicated in the dependent claims.
More specifically, the present disclosure provides a computer-implemented method for the encoding, storage and/or transmission of a representation of genome sequencing data in a genomic file format comprising annotation data associated with said genome sequencing data, said genome sequencing data comprising reads of sequences of nucleotides, said method comprising the steps of:
Preferably, the method further comprises jointly coding said access units of first sort, of second sort and said MAI.
The method may further comprise a step of storing or transmitting the encoded genome sequencing data on or to a computer-readable storage medium; or making the encoded genome sequencing data available to a user in any other way known in the art, e.g. by transmitting the genome sequencing data over a data network or another data infrastructure.
In the context of this disclosure, descriptors may be implemented, e.g., as genomic annotation descriptors as defined in the detailed description below.
It is further preferable that said access units of the second sort containing genomic annotation data further comprise information data identifying a genomic interval, wherein said genomic interval identifies a sequence of nucleotides in the one or more reference sequences such that the annotation data contained in the access units of the second sort are associated with the related encoded reads of the genomic sequence contained in access units of the first sort containing genome sequencing data.
According to a (further) preferred embodiment, the encoding of said annotation data and indexing data comprises the steps of:
It is further preferred that said first string transformation method comprises the steps of:
According to a (further) preferred embodiment, the string indexing transformation method is one of string pattern matching, suffix arrays, FM-indexes, hash tables.
Preferably, said at least one second transformation method is one of: differential coding, run-length coding, bytes separation, and entropy coders such as CABAC, Huffman Coding, arithmetic coding, range coding.
According to a (further) preferred embodiment, said master annotation index contains in its header the number of AU types and the number of indexes for each AU type.
Further preferably, the above-described method further comprises coding of classified unaligned reads.
The object of the invention is further solved by a method for the decoding and extraction of sequences of nucleotides and genomic annotations data encoded according to the method described above, said method comprising the steps of:
Preferably, said method further comprises decoding information data related to a genomic interval, wherein said genomic interval identifies a sequence of nucleotides in the one or more reference sequences such that the annotation data are associated with the related encoded reads of the genomic sequence.
It is further preferred that the method further comprises decoding the data encoded according to the method for the storage or transmission of a representation of genome sequencing data in a genomic file format comprising annotation data associated with said genome sequencing data described above.
According to a further aspect of the present disclosure, a genomic encoder for the compression of genome sequence data in a genomic file format comprising annotation data associated with said genome sequencing data is proposed, wherein said genome sequence data comprises reads of sequences of nucleotides, said and wherein said encoder comprises:
Preferably, the encoder comprises means for jointly coding said access units of first sort, of second sort and said MAI.
According to a (further) preferred embodiment, the genomic encoder comprises encoding means for performing the steps of the encoding method described above.
The present disclosure further refers to a genomic decoder apparatus for the decoding of sequences of nucleotides and genomic annotations data encoded by the encoder described above, said decoder comprising:
Preferably, the genomic decoder further comprises decoding means for performing the steps of the decoding method described above.
According to a further aspect of the present disclosure, a computer-readable medium is proposed, the computer-readable medium comprising instructions that when executed by at least one processor, cause the at least one processor to perform the method described above.
In this disclosure the following terms and expression are used:
The data structure further disclosed by this invention relies on the concepts of:
A Genomic Data Stream is a packetized version of a Genomic Data Layer where the encoded genomic data is carried as payload of Genomic Data Packets including additional service data in a header. See
A Genomic Data Multiplex is defined as a sequence of Genomic Access Units used to convey genomic data related to one or more processes of genomic sequencing, analysis or processing.
Important aspects of the disclosed solution are:
The encoding scheme of the genomic reads is represented in the encoder of
The sequence reads generated by sequencing machines are classified by the disclosed invention into five different “classes” according to the results of the alignment with respect to one or more reference sequences. Said classes are defined based on matching with/mapping on the reference genome according to the presence of substitutions, insertions, deletions and clipped bases with said one or more reference sequences.
When aligning a DNA sequence of nucleotides with respect to a reference sequence five are the possible results:
Unmapped reads can be assembled into a single sequence using de-novo assembly algorithms. Once the new sequence has been created unmapped reads can be further mapped with respect to it and be classified in one of the four classes P, N, M and I.
Once the classification of reads is completed with the definition of the classes, further processing consists in defining a set of distinct syntax elements which represent the remaining information enabling the reconstruction of the DNA read sequence when represented as being mapped on a given reference sequence. A DNA segment referred to a given reference sequence can be fully expressed by:
This classification creates groups of descriptors (syntax elements) that can be used to univocally represent genome sequence reads.
For each layer of the genomic data structure disclosed in this invention different coding algorithms may be employed according to the specific features of the data or metadata carried by the layer and its statistical properties. The “coding algorithm” has to be intended as the association of a specific “source model” of the descriptor with a specific “entropy coder”. The specific “source model” can be specified and selected to obtain the most efficient coding of the data in terms of minimization of the source entropy. The selection of the entropy coder can be driven by coding efficiency considerations and/or probability distribution features and associated implementation issues. Each selection of a specific coding algorithm will be referred to as “coding mode” applied to an entire “layer” or to all “data blocks” contained into an access unit. Each “source model” associated to a coding mode is characterized by:
For each data layer the source model adopted in one access unit is independent from the source model used by other access units for the same data layer. This enables each access unit to use the most efficient source model for each data layer in terms of minimization of the entropy
Genomic annotations, browsers tracks, variant information, gene expression matrices and other annotations referred to in this invention are associated with, for example, and not as a limitation, nucleotide sequences, Deoxyribonucleic acid (DNA) sequences, Ribonucleic acid (RNA), and amino acid sequences. Although the description herein is in considerable detail with respect to annotations to a reference genome in the form of a nucleotide sequence, it will be understood that the methods and systems for compression can be implemented for annotations of other genomic or proteomic sequences as well, albeit with a few variations, as will be understood by a person skilled in the art.
Genomic functional annotations are defined as notes added by way of explanation or commentary to identified locations of genes and coding or non-coding regions in a genome to describe what is the function of those genes and their transcripts.
Genomic variants (or variations) describe the difference between a genomic sample and a reference genome. Variants are usually classified as small-scale (such as substitutions, insertions and deletions) and large-scale (a.k.a. structural variations) (such as copy number variations and chromosomal rearrangements).
Genome browser tracks are plots associated to aligned genome sequencing reads visualized in genome browsers. Each point in the plot corresponds to one position in the reference genome and expresses information associated to said position. Typical information represented as browser tracks is the presence and structure of transcripts, sequence variants in an individual or a population, coverage of sequencing reads, intensity of protein binding to each position of the genome, etc
Gene expression matrices are two-dimensional arrays where rows represent genomic features (usually genes or transcripts), columns represent various samples such as tissues or experimental conditions, and numbers counting the number of times each gene is expressed in the particular sample (the counter is also known as “expression level” of the particular gene).
Contact matrices are produced by Hi-C experiments and each i,j entry measures the intensity of the physical interaction between two genome regions i and j at the DNA level. At the lowest granularity, i and j denote two positions on the genome represented as a single sequence of all concatenated chromosomes.
To date, the classes of annotation data listed above are represented using different and incompatible textual formats usually compressed using general purpose text compressors such as gzip, bzip2 etc. In most cases, analysis programs process this information by first uncompressing the entire file and then parsing the decoded text to look for, and if present, to extract, the required piece of information. It is rather frequent for each of the formats used for each category of data to be independently, and sometimes drastically, modified by different users or groups of users to generate several “variations” or “dialects” of the same format. This fact generates serious interoperability problems and the need to “sanitize” each file format variation before being able to exchange data.
Another limitation of current formats is the lack of support for establishing links among different types of annotation data when represented in compressed form. For example, associating a set of variants to a given gene requires to:
State of the art formats have the drawback of being stored on a different file. This is inefficient insofar as data compression is concerned, and does not support any efficient process to perform a query on a compressed file. Retrieving all variants related to a given gene XYZ and possibly at the same time the expression of that gene in a set of samples cannot be done without decompressing the whole concerned files and parsing all their content. The described process of associating variants to a gene today can only be achieved by combining several inefficient operations of data decompression, parsing and processing, and by describing relations between the different features by means of novel ad-hoc formats which are not currently available or standardized.
As an example, but not as a limitation, the method disclosed in this document addresses the drawbacks of current solutions when trying to determine variants of clinical relevance with a variant calling pipeline, and visualise the results in a way which allows clinicians to easily inspect and validate results. The goal is to use genome re-sequencing to identify variants which can be related to the manifestation of a disease or a particular phenotype of interest. Variants are determined by first aligning genome sequencing reads to a reference genome and subsequently using the alignment information at all positions, accumulated for all reads (“pileup”), to call genomic variants, such as Single Nucleotide Polymorphisms (SNPs), through a suitable variant calling program. Variant calling is a complex operation requiring complex pipelines of tools performing sophisticated processing. False positive or false negative results can arise due to a number of technical problems, such as fluctuations in coverage or the variant being located in a repetitive genome region. Due to these problems, in clinical setups the variants of potential clinical significance are usually validated manually by a human operator before being included in a medical report. However, data processing and validation requires the access and correlation of a number of information elements (genome sequence, genome annotation, reads alignment, sequencing coverage, sequencing pileup in the regions flanking the variant), each one typically stored in separated files and represented using a different file format. In particular, it is not possible with current technologies to explicitly state relations such as “this set of sequencing reads, aligned to this range of positions in the genome (i.e. interval), supports this variant, which is contained in this genomic feature” as the different entities (aligned reads, variants, genomic features) are represented in separated and different files. Today this result can only be achieved by:
These various steps may require very long times up according to the sizes of the parsed textual files which can be in the range of several Gigabytes up to hundreds of Gigabytes.
The present invention aims at addressing these limitations by providing:
In this example of variant calling in a clinical setup, data processing and visualization are accomplished by encoding two distinct compressed data structures (that may or may not be contained in the same file) linked by a bidirectional indexing mechanism. Said data structure contain:
In particular, the encoded information is contained in a hierarchical structure, as the one described in the present disclosure, linking:
Current state of the art technologies allow the representation of the different sources of information needed for genomic data annotation and variant calling separately (aligned reads with SAM/BAM/CRAM files, genome annotations with GTF/GFF3 files, variants with VCF/BCF files, plus various indexing file formats required to implement range searches). They do not support explicit representation of bi-directional relations between different entities. Moreover, a software analysis workflow (or “pipeline”) performing variant calling needs to operate on different file formats depending on the analysis stage, rather than on a single data structure as provided by the present disclosure. It is possible to represent different sources of information as a single genome browser, but that requires the manipulation of a number of different file formats, and there is no way to specify to the genome browser that features belonging to different files are correlated.
In an embodiment, this invention presents important technical advantages for the use case of variant calling analysis as described in the text below.
The advantages of the present method with respect to state of the art solutions in terms of efficient data retrieval for variant calling analysis are the following.
State of the art technologies support the representation of the different pieces of information needed for the described use cases by using different data structures and formats (aligned reads with SAM/BAM/CRAM file formats, genome annotations with GTF/GFF3 file formats, variants with VCF/BCF file formats, plus various types of independent indexing file formats used to implement only range searches). These state of the art technologies do not support the explicit representation and linkage of relations between different pieces of information. A pipeline performing variant calling needs to operate on different file formats depending on the analysis stage, rather than on a single compressed data structure selectively accessible as proposed in the present approach. Employing current state of the art technology it is possible to feed a genome browser with the different pieces of genomic information, but this requires a complex pre-processing stage consisting of the manipulation and parsing of a number of different file formats in non-compressed form. Moreover, there is no way of specifying to the genome browser for appropriate display, the correlation between annotations, biological features and sequencing data.
As an example, but not as a limitation, the method disclosed in this document addresses the drawbacks of existing solutions when trying to compile large databases of genomic variants. The scenario is similar to the one considered in the previous case, i.e a setup where researchers or clinicians are trying to validate and collect genomic variants based on sequecning techniques. However, we now assume that said researchers or clinicians are interested in cataloguing a large number of variants - ideally all the variants in each genome - for a potentially very large number of individuals (one could think about initiatives trying to cover an increasing portion of the population, with the final goal of covering it in its entirety). In this example, one would first perform variant callling and generally follow the analysis steps described in the previous use case; the process would then be repeated for all samples. After that, the researcher would usually query information about the results of data analysis, such as “How many individuals possess this specific variant?”, or “Is this variant supported consistently in all the individuals considered?”, or “How many people in the sample have any of the variants contained in a given dataset of clinically relevant variants? And what is the list of such variants for each individual?” At the moment, there are ways of storing the list of variables, typically as VCF/BCF files; however, the sizes of such population-level files are very large — which makes querying them technically challenging — and only very limited querying capabilities (i.e., retrieving variants in a specified genomic interval) are possible.
The advantages of the present method with respect to state-of-the-art solutions are the following ones:
While storing large databases is possible by means of currently available formats such as VCF/BCF, the process is complex due to the complexity of the formats and the resulting files are relatively bulky due to the use of generic compression methods and because different sources of information are mixed together in the same record, making compression less efficient. In addition, formats such as VCF/BCF are not designed with complex queries in mind - it is only possible to query them by genomic range, in order to retrieve all the variants present in a genomic interval. Further filtering, such as selecting variants depending on whether they are present in some specified individual, must be performed separately. Finally, as described in the previous use case, there is no capability to cross information about genomic variants with other sources of information, such as lists of supporting sequencing reads or lists of functional genomic features.
As an example, but not as a limitation, the method disclosed in this document addresses the drawbacks and inefficiencies of current solutions when trying to determine biological mechanisms through which particular phenotypes originate. This is achieved by coding in the same compressed data structure several pieces of information (for instance, a number of “omics” sequencing-based experiments). The identification of complex molecular mechanisms requires the combination of a number of experimental techniques, each one probing a different cell compartment (for instance, ChIP-seq experiments investigating chromatin structure, bisulfite-sequencing experiments determining genome methylation, and RNA-seq experiments determining how transcription is regulated).
Molecular mechanisms underlying genotypes are determined by analysing the interaction and correlation between patterns occurring concurrently in different cell compartments when the same biological condition is sequenced. Chromatin markers are determined as peaks in ChIP-seq tracks, which are obtained by accumulating alignments to the reference genome; methylation patterns are obtained by special alignment pipelines able to process BS-seq data, as bisulfite treatment generates reads with modified bases whose sequence is not present in the original genome; RNA-sequencing data is processed by ad-hoc alignment pipelines able to perform spliced alignments, as the cell machinery derives RNA sequences by chaining together one or more blocks of genomic sequences (“exons”) and discarding the sequences occurring between blocks (“introns”), which gives rise to sequences which are not present in the original genome; and so on, depending on the specific “omics” experiment being considered.
The data generated by each “omics” experiment usually requires complex analysis pipeline, each one tailored on the type of sequences being generated by the specific biological protocol employed (ChIP-seq, BS-seq, RNA-seq, etc.). Each pipeline usually requires a variety of types of data (genome sequence, genome annotation, sequencing reads, reads alignment, sequencing coverage, sequencing pileup), each one typically stored in a different file and represented using a different file format, to be considered and correlated. In particular, it is not possible with current technologies to explicitly state relations such as “in a given biological condition this set of sequencing reads, aligned to this range of positions in the genome, supports this ChIP-seq peak, which is correlated with particular patterns of RNA expression and genomic/histone methylation” as the different entities (aligned reads, ChIP-seq peaks, methylation patterns, genomic features, different biological conditions) are represented separately in different files.
In an embodiment, genomic data processing and visualisation are improved by means of the the present invention by presenting in the same compressed data structure:
In particular, the joint compressed data structure contains a hierarchical organization, as the one described in the present disclosure, linking:
The advantages of the present method with respect to existing solutions in terms of efficient data retrieval for correlating information coming from several “omics” experiments are listed below.
Existing technologies allow users to represent the different sources of information needed for this use case separately (aligned reads with SAM/BAM/CRAM files, genome annotations with GTF/GFF3 files, ChIP-seq peaks, RNA expression levels and other “omics” feature with other file types, plus various indexing file formats required to implement range searches). They do not support the explicit representation of relations between different entities. A pipeline performing analysis of each kind of “omics” data needs to operate on different file formats depending on the analysis stage, rather than on a single compressed data structure as proposed in the present approach. It is possible to present different sources of information as a single genome browser, but that requires the manipulation of a number of different file formats, and there is no way of describing to the genome browser that features belonging to different files are correlated.
With reference toWO 2018/068827A1, WO/2018/068828A1 and WO/2018/068830A1 throughout this disclosure, an Access Unit (AU) is defined as a logical data structure containing a coded representation of genomic information to facilitate the bit stream access and manipulation. It is the smallest data organization that can be decoded by a decoding device implementing the invention described in this disclosure. An Access Unit is characterized by header information and a payload of compressed data structured as a sequence of blocks each one possibly compressed using different compression schemes.
The invention described in this document introduces new Access Units types containing genomic annotation data such as genomic features, functional annotations, browsers tracks, genomic variants, gene expression information, contact matrices, genotype data.
In the context of this disclosure the following definitions apply:
A collection of one or more genomic datasets is called dataset group.
read class: ISO/IEC 23092 and WO 2018/068827A1, WO/2018/068828A1 and WO/2018/068830A1 and WO2018152143A1 specify how genome sequence reads are classified and encoded according to the result of the alignment of said reads on a reference genome. According to the type and number of mapping errors each read or read pair is assigned to a different class.
AU class: each AU contains reads belonging to a single class.
Annotation data type: in the context of this disclosure, annotation data types characterize the set of genomic annotation information included in one of these categories: genomic features, functional annotations, browsers tracks, genomic variants, gene expression information, contact matrices, genotype data, genomic samples information.
In the context of this disclosure, genomic annotation descriptors are syntax elements representing part of the information (and also elements of a syntax structure of a file format and/or a bitstream) necessary to reconstruct (i.e. decode) coded reference sequences, sequence reads, associated mapping information, annotations, browsers tracks, genomic variants, gene expression information, contact matrices and other annotations associated to genome sequencing data. The genomic annotation descriptors which are common to all annotation data types disclosed in this invention are listed in Table 1.
Other descriptors specific to each annotation data type are disclosed in the syntax and semantics table devoted to each annotation data type.
Textual descriptors are those represented as string of characters while numeric descriptors are those represented by numerical values.
Genomic annotation descriptors can be of three types:
According to the method disclosed in this invention, genomic annotations, browsers tracks, genomic variants, gene expression information, contact matrices and other annotation data types associated with genome sequencing data are coded using a sub-set of the descriptors listed in Table 1 which are then entropy coded using a multiplicity of entropy coders according to each descriptor specific statistical properties. This means that different types of descriptors are grouped together and coded with different entropy coders, thereby attaining higher compression. Blocks of compressed descriptors with homogeneous statistical properties are structured in Access Units which represent the smallest coded representation of one or more genomic feature that can be manipulated by a device implementing the invention described in this disclosure.
Genomic annotation descriptors are organized in blocks and streams as defined below.
A block is defined as a data unit composed by a header and a payload, which is composed by portions of compressed descriptors of the same type.
A descriptor stream is defined as a sequence of encoded descriptor blocks used to decode a descriptor of a specific Data Class.
This disclosure specifies a genomic information representation format in which the relevant information is efficiently compressed to be easily accessible, transportable, storable and browsable and for which the weight of any redundant information is reduced.
The main innovative aspects of the disclosed invention are the following.
In the following, each of the above aspects will be further described in detail.
Data on genomic variants is encoded using the common descriptors introduced above and the specific descriptors listed below.
Data on functional annotation describes genes and their content - spliced transcripts, with their biological function, in terms of their constituent exons; and information about the transcripts, such as, whenever applicable, their decomposition into UTRs, start and stop codon, and coding sequence. It is encoded using the common descriptors introduced above and the specific descriptors listed below.
sizeof() is a function which returns the number of bits necessary to represent each attribute value according to the type_ID defined in the attribute type.
Data for a track represents a numerical value associated to each position in the genome - a typical example for it would be the coverage of sequencing reads at each position as produced by an RNA-or ChIP-sequencing experiment. Data can be provided at different pre-computed zooming level, which is desirable when the information is being displayed in a genome browser. Data is encoded using the common descriptors introduced above and the specific descriptors listed below.
Genotype information data expresses the set of genomic variants present at each position of the genome for an individual or a population of individuals. It is encoded using the common descriptors introduced above and the specific descriptors listed below.
Information on samples describes meta-information on specific biological samples on which the sequencing experiment has been conducted, such as collection date and location, sequencing date, etc. Sample information data is encoded using the specific descriptors listed below.
Information on expression associates some genomic range (typically corresponding to a gene, a transcript or another feature in the genome) with one or more numerical values - each value would correspond to a biological condition that has been tested during a separate experiment. Expression data is encoded using the specific descriptors listed below.
Contact information data is encoded using the specific descriptors listed below.
The present invention introduces a compressed representation of annotation data associated with genome sequencing data in the form of a bitstream syntax described below. The syntax is described in terms of the concatenation of data structures composed by elements characterized by a data type.
In the following description the following syntax notation is adopted.
The present disclosure extends the data structures specified in ISO/IEC 23092-1 in order to support the transport of coded genomic annotation in the bitstream syntax specified in ISO/IEC 23092-1.
The dataset group syntax is the same as the one specified in ISO/IEC 23092-1
In ISO/IEC 23092-1 a dataset is a data structure containing a header, Master configuration parameters in a parameter set an indexing structure and a collection of access units encoding genomic data. Dataset types are extended to carry genomic annotation data of different types specified by different “dataset-type” values.
This is a box describing the content of a dataset.
This data structure extends the reference data structure specified in ISO/IEC 23092 to support the bitstream syntax specified in this disclosure.
The present disclosure describes how to encode (i.e compress) the annotation data portion composed of textual information elements associated with genome sequencing reads, other non textual genomic annotations and sequences derived from the genome so as to make the textual elements searchable in the compressed domain. Examples include:
Said information is compressed using a suitable data structure, such as, as an example and not as limitation, compressed string pattern matching data structures. Representatives of compressed string pattern matching data structures are, as examples and not as limitations, compressed suffix arrays, FM-indexes, and some categories of hash tables. Such (compressed) data structures are used to perform string pattern matching, and to carry in compressed form the textual portion of the annotational data being added to the compressed bitstream either in the file header or as a payload of an Access Unit. For clarity, in this disclosure all algorithms belonging to one of these data structure categories will be referred to as “string indexing algorithm”.
As an example, but not as a limitation, this disclosure describes how to encode the textual portion of the different annotation data types and the genomic reads by using a combination of compressed string indexing algorithms. Several families of string indexing algorithms exist, and each family can be parameterized by a number of parameters, which specify the balance between compression performance and querying speed. We use for compression a pre-determined set of compressed string indexing algorithms, each one specified by the choice of a compressed string indexing algorithm family and by a choice of parameters for that family. The set of algorithms is sorted by the compression level attained, and, depending on the desired trade-off between compression rate/querying speed, one specific algorithm can be selected when encoding. This choice is specified in the parameter set of the compresssed bitstream.
As an example, but not as a limitation, the chosen compressed string indexing algorithm is separately or jointly applied to the concatenation of:
Applying a compressed string indexing algorithm to said information produces a compressed and indexed representation which can be queried for the presence of arbitrary substrings. In particular, a combination of exact substring searches can be used to perform inexact substring searches, for example searches that retrieve all occurrences of substrings with up to a specified number of deviations (mismatches/errors) from the specified pattern. This process enables querying for a piece of textual information the genomic annotations considered or produced during analysis and re-analysis of sequencing data, in a single query. This is possible if:
The following text and data structures describe an embodiment of this method for the indexing and search of genomic annotation data compressed and embedded in access units of a bitstreams compliant with MPEG-G (ISO/IEC 23092).
The table below shows the textual information indexed and compressed using a string indexing algorithm per each genomic annotation type according to the method described in this document. For each Access Unit, textual descriptors of each type are concatenated using a string separator and record indexing information as shown in
This table describes the indexing criteria and indexing tools applied to Access Units for each genomic annotation data type.
The Master Annotation Index (MAI) is an indexing tool which provides for annotation data the indexing capabilities of sequence reads of the MIT defined in ISO/IEC 23092-1 and WO 2018/068827A1, WO/2018/068828A1 and WO/2018/068830A1
Master Annotation Index Header
num_mai_AU_types is the number of AU types indexed by MAI. A value of 0 signals that no indexing is provided by the MAI.
mai_AU_type[i] is the i-th AU type indexed by the MAI. The array mai_AU_type[] shall contain unique values, that is each AU type value can appear only once in the array mai_dataset_ID []. num_mai_indexes[i] is the number of MAI indexes for the AU type mai_AU_type[i].
When encoding an Access Unit of each genomic annotation data type, textual descriptors belonging to data encoded in said access unit are concatenated and compressed using a compressed string indexing algorithm as defined in this disclosure.
The table below lists which strings are encoded in a MAI, for each data type. The specified list of strings determines the value numStrings that is required in some of the following description for MAI. numStrings is the number of textual fields per genomic annotation record indexed using the method described in this invention.
A String Index block is a portion of a Master Annotation Index that encodes one or more strings for each Record, for a variable number of Access Units each containing a variable number of Records. The Master String Index also allows string pattern matching queries on the original text to be performed and retrieved.
The list of strings encoded within a String Index is referred to in the following as “compressed index”. The list of strings obtained by decoding a compressed index from a String Index is referred to in the following as “uncompressed index”.
The String Index provides the following functionalities:
Inputs to this process are:
The number of strings encoded for each record shall be the same for all records, and it shall correspond to the variable numstrings as specified in the description below.
The uncompressed index encoded within compressed_index contains a list of strings and the associated optional record indexes, ordered per Access Unit (following the same order of the Access Units in Table 4) and, for each Access Unit, per Record (following the same order of the Records within the Access Unit). The total number of strings in the uncompressed index is totNumRecords∗numstrings, where totNumRecords is the total number of records of all Access Units identified by au_id[], and numstrings is a counter of all strings compressed using said compressed indexing algorithm.
The uncompressed index specified as:
An example, with numStrings equal to 3, of the uncompressed index specified in this disclosure is provided in
record_index[i] (rec_idx), whose presence is signaled by setting the most significant bit on all the bytes of record_index[i]. Setting the most significant bit also prevents from obtaining false-positive results when searching for sub-strings, since all bytes in string[i][j] field have the most significant bit unset as specified in this disclosure for string[i][j] element.
When record_index[i] is present and it is N bytes long, it represents a non-negative integer value as specified in the following expression:
where recordIndexValue[i] corresponds to the 0-based index, within the corresponding Access Unit, of the Record corresponding to string[i][] strings.
In the context of this disclosure, record_index[i] is referred to as “genomic annotation record index data”.
string[i][j] is the jth encoded string of the ith record. The strings shall be ordered per Access Unit (following the same order of the Access Units in Table 4) and, for each Access Unit, per Record (following the same order of the Records within the Access Unit)
string_terminator is a single byte equal to 0×0A (i.e. ‘\n’).
The positions within the uncompressed index of a given substring are searched with String Index as specified in the following pseudocode:
The String Index is decoded between a given start and end positions, inclusive, as specified in the following pseudocode:
Given a position within the uncompressed index, e.g. one position from the list of positions returned by SI_search_substrings () as specified in this disclosure, the corresponding whole string and its start position within the uncompressed index are decoded with the String Index as specified in the following pseudocode:
Given a position, within the uncompressed index, of a byte that belongs to a string encoded in the compressed index, e.g. one position from the list of positions returned by SI_search_substrings () as specified in this disclosure, the Access Unit ID of the Access Unit that contains the said string, the index of the Record that contains the said string, and the index of the said string within the said Record are decoded with the String Index as specified in the following pseudocode:
The position, within the uncompressed index, of the first string of a given Access Unit is retrieved with the String Index as specified in the following pseudocode:
The position, within the uncompressed index, of a string at a given index within a Record, where the Record is at a given index within a given Access Unit, is retrieved with the String Index as specified in the following pseudocode:
According to the principle of this invention a string index is constructed from textual descriptors using a string transformation method as follows:
Numeric descriptors are represented as numerical values and textual descriptors are represented as strings of characters.
In order to compress the resulting string index, the result of the transformation is then further transformed using a compressed full-text string indexing algorithm such as compressed suffix arrays, FM-indexes, and some categories of hash tables.
Interleaving information related to genomic annotation with genomic annotations record positions enables to browse the compressed genomic annotation data according to criteria such as the presence of a string in a record or the genomic interval a genomic record is associated to. Said browsing is performed by specifying textual strings or substrings and retrieving all genomic annotations records containing said text as part of the coded annotation.
An example of an implementation of this construction method is provided in
The textual descriptors associated with each genomic annotation type described in this disclosure to build the string index as described above and in
By building the compressed string index as described above, it is possible to reconstruct the genomic annotation related to one string descriptor by following the process below.
The goal of this process is to decode all Access Units containing annotation data related to a string identifier specified by a user who is searching for example a variant name or description thereof, genomic feature name or description thereof or any other textual descriptor associated with a coded genomic annotation.
Search the desired name or description by calling the function SI_search_substrings () specified above. If the specified string “str” is present in the compressed index, this call returns one or more positions (named “pos” in this example) as specified in section “Search for substring positions with the String Index”. The Access Unit ID of the Access Unit that contains said string “str”, the index of the Record that contains said string “str”, and the index of said string within the said Record are decoded with the String Index described above in this disclosure as described in the following points:
This clause extends the Access Unit syntax specified in ISO/IEC 23092-1 with support of genomic annotations data type encoding.
AU Header
This method provides a unified approach over all the different annotation data types, regardless of their nature, and gives room for future indexing/filtering tools based on the presence of a specific attribute.
The information on variants is coded in the data structures described in this section, while the information on samples (e.g. genotyping) are coded in a separate dataset.
This structure in the parameters set contains Master parameters related to variant coding.
Records shall be sorted per ascending value of pos. Positions are then coded differentially NB: ref_len, ref, alt_len, alt, q_int can be coded as “payload” in the unified record structure; info as attributes.
Genomic annotation records for variants are coded using the common genomic annotation descriptors and the genomic annotation descriptors specific to variants as described in this disclosure.
Info values are compressed as attributes as described in this disclosure ref and alt information
This structure in the parameters set contains global configuration parameters related to the coding of functional annotations data types.
Genomic annotation records for functional annotations are coded using the common genomic annotation descriptors and the genomic annotation descriptors specific to functional annotations as described in this disclosure. Compression of descriptors for annotations
This structure in the parameters set contains global parameters related to browser tracks coding.
Genomic annotation records for functional annotations are coded using the common genomic annotation descriptors and the genomic annotation descriptors specific to functional annotations as described in this disclosure.
Compression of descriptors of tracks
A dataset of type genotype contains coded information related to genotyping information of individuals or populations.
This structure in the parameters set contains global configuration parameters related to genotype information coding.
Genotype format fields
A = one value per alternate allele R = one value for each possible allele including the reference G = one value per genotype
Genomic annotation records for genotype information are coded using the common genomic annotation descriptors and the genomic annotation descriptors specific to genotype information as described in this disclosure.
All the information is compressed as attributes, as described in this disclosure. Special cases, such as GT and LD fields, are first split in subsequences identified by subsequencelD as described below.
This structure in the parameters set contains global configuration parameters related to the coding of information about samples.
Genomic annotation records for samples information are coded using the genomic annotation descriptors specific to samples information as described in this disclosure.
This dataset codes only the actual expression matrix. The features are stored in access unit of type AU_ANNOTATION and the samples in access units of type AU_SAMPLE.
This structure in the parameters set contains global configuration parameters related to the coding of expression information.
format_IDidentifies a format field present in the coded records. The semantics of each identifier is provided in Table 12. (table 12)
Genomic annotation records for expression information are coded using the genomic annotation descriptors specific to expression information as described in this disclosure.
The compression strategy is the same as for the Genotype datasets: all the information are mapped into attributes and compressed, as described in the section titled “Compression of Attributes”. This allows to have more than one value for each element of the matrix, thus combining in a single record information such as counts, tpm, probabilities etc., with different types and semantics.
A special approach is used for sparse matrices, where, for each record, only the non-zero values are recorded, together with an array of the corresponding positions and the total number of entries.
Contact matrices (a.k.a. contact maps) are generated by Hi-C experiments and represent the spatial organization of a DNA molecule in the cell nucleus. The two dimensions are genomic positions. The contact matrix value at each coordinate represent a counter of how many times the two positions in the nucleotide sequences have been measured to have an interaction.
This structure in the parameters set contains global configuration parameters related to the coding of information on contact matrices.
format_IDidentifies a format field present in the coded records. The semantics of each identifier is provided in Table 12 (Table 12)
Genomic annotation records for samples information are coded using the genomic annotation descriptors specific to sample information as described in this disclosure.
The compression strategy is the same as for the Expression information datasets.
Attributes
Attributes are compressed using as many subsequences as n attributes in the parameter set + 1
This sections describes how structured values are represented in this disclosure.
This is a structure used to represent numerical values with their sizes in bits.
Type identifiers
Array identifiers
Data blocks are structures containing the compressed descriptors and encapsulated in Access Units. Each block contains descriptors of a single type which is identified by an identifier contained in the block header
The present invention removes a number of problems present when using state of the art technologies. In particular:
As an example but not as a limitation, we illustrate the concept by combining two different families of compressed suffix arrays. Family [1] uses bitvectors implemented as described in Raman, Rajeev, Venkatesh Raman, and S. Srinivasa Rao. 2002. “Succinct indexable dictionaries with applications to encoding k-ary trees and multisets.” In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), 233-242. Family [2] uses bitvectors implemented as described in Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi. Hybrid Compression of Bitvectors for the FM-Index. In Proc. 2014 Data Compression Conference (DCC 2014), IEEE Computer Society, 2014, pp. 302-311. As shown in the
The output of each entropy coder is fed to an Annotation Data Access Unit coder 23 to produce Annotation data Access Units 25. The Uncompressed Master Annotation Index 210, output of the descriptor string index transformation unit 26 is fed to an Annotation data indexing coder 28 to produce Master Annotation Index Data 29. One annotation data index is associated with one or more Annotation data Access Units.
The transformations applied by the descriptors transformation units 21 and 27 used in the encoding apparatus include:
The transformations applied by the annotation data indexing coder 28 include:
The advantages of applying said transformation to numerical descriptors is to improve compression efficiency without loss of information as it is known to any person skilled in the art.
Coding of string descriptors is made more efficient by said transformation as the transformed representation is more efficiently browsable and searchable for sub-strings. Once the original text is transformed, the presence of sub-strings can be verified without decompressing the whole text.
A decoding apparatus implemented according to the principles of this disclosure extends the functionality of a decoding apparatus compliant with ISO/IEC 23092 as depicted in
A textual search apparatus implemented according to the principles of this disclosure extends the functionality of a decoding apparatus compliant with ISO/IEC 23092 as depicted in
The Master Index Table associates
A single query on a textual string “APOBEC” can retrieve all the associated annotations including the text “APOBEC” and associated coded sequence reads.
The Master Index Table associates
A single query on the genomic interval N can retrieve the coded sequence reads and all the associated annotations.
The inventive techniques herewith disclosed may be implemented in hardware, software, firmware or any combination thereof. When implemented in software, these may be stored on a computer medium and executed by a hardware processing unit. The hardware processing unit may comprise one or more processors, digital signal processors, general purpose microprocessors, application specific integrated circuits or other discrete logic circuitry.
The techniques of this disclosure may be implemented in a variety of devices or apparatuses, including mobile phones, desktop computers, servers, tablets and similar devices.
Number | Date | Country | Kind |
---|---|---|---|
20169717.4 | Apr 2020 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/056766 | 3/17/2021 | WO |