The present invention relates to efficient storage and transfer of next generation sequencing data.
Tremendous progress over the past decade in the technology and adoption of next generation sequencing (NGS) has brought with it a rapid decline in cost, to the point where the price in 2015 of high-coverage sequencing of a whole human genome was $1,000. In parallel, scale has been growing quickly and the genomes of 228,000 individuals have been sequenced in 2014. Global NGS capacity has doubled every 7 months in recent years and is projected to continue to double every 12 months in the near to medium-term future.
NGS is generating raw data at a rate that is projected to grow to 2-40 exabytes per year by 2025, eclipsing all other fields of science and technology. This raw data, while meant to undergo reduction by downstream processing, is nevertheless extensively shared and almost always archived. Its storage, transfer and management represent, therefore, a technological and economical challenge to the continued development of NGS. Data compression has proved to be an invaluable tool in many fields of technology, and it will play a key role in NGS.
The bulk of NGS data is stored in files conforming to one of a handful of de-facto standards. Reference is made to
FASTQ is the de-facto standard file format for storing the output data of NGS machines. FASTQ files are text-based and each machine output read is represented by four lines of text, as shown in
SAM (Sequence Alignment/Map) and BAM (Binary Alignment/Map) are the de-facto file formats used to store the output of short-read alignment programs such as BWA and Bowtie. SAM specifies a text format consisting of a header section, which is optional, followed by one or more alignment sections, reporting each on the alignment of one read, as shown in
Each alignment section consists of a line of text that represents the result of the alignment of one read, as shown in
An alignment section contains 11 mandatory fields that provide information on:
BAM is a compressed version of SAM. A BAM file is created by dividing a SAM file into blocks of up to 64 Kbytes, compressing each block to a gzip archive and concatenating the archives to create a single output file. In order to support random read operations into the BAM file, a companion BAM Index (BAI) file may also be created. For this purpose, the alignment sections in the SAM file must be ordered by position in the genome, and the BAI file contains a data structure that efficiently maps a position or a range within the genome to the offset in the BAM file of the relevant gzip block or blocks.
A number of algorithms form the basis of many lossless NGS data compression schemes.
One algorithm used for NGS data compression is word substitution. A field within a data format, also referred to as a symbol, may sometimes be longer than the number of bits strictly needed to encode its alphabet—the set of values that it can take. In such a case, a one-to-one mapping can be defined from each letter of the alphabet to a value of a shorter corresponding field that will be incorporated into the compressed format. E.g., FASTQ uses a byte to encode either one of the four DNA bases or an undefined readout (‘N’). This set of five letters can be encoded by just 3 bits. As 3 bits can actually represents 8 different letters, the compression ratio can be improved by using 7 bits to encode a triplet of bases, bringing efficiency up from 5/8 to 53/27 (125/128, or 98%).
Another algorithm used for NGS data compression is probability-weighted encoding. The ratio of compression can be improved beyond what is achievable by word substation in cases where the letters of a symbol's alphabet occur at unequal, but known, probabilities. Huffman coding maps symbols to variable-length code-words so that shorter code-words represent higher-probability letters and vice versa.
Reference is made to
A Huffman code is designed by building a binary tree, starting with a set of unconnected leaf nodes representing each a letter of the symbol's alphabet. In the first step of the process, a new branch node is formed to serve as the parent of the two leaf nodes with the lowest combined probability. The newly created node is assigned the sum of the probabilities of its two child nodes. This process is repeated for the set of nodes at the roots of the sub-trees that are still unconnected, until they are all connected by the global root node. Starting now at the root, code-word prefixes are assigned to the branches of the tree by adding 0 or 1 to the prefix of the branch leading to the parent node. For the single branch node in
Modern sequencing machines generally assign a high quality score to the majority of decoded bases: for Illumina software, Version 1.8 and later, scores ‘A’ through ‘J’ (32 through 41 on the Phred scale) are, in most datasets, more common than ‘!’ through ‘@’. Huffman coding of quality scores improves compression ratio beyond that achievable by word substitution (which only takes advantage of the fact that the 42 quality score letters can be encoded by 5.4 instead of 8 bits).
The design of a Huffman code is restricted by the fact that the lengths of code-words are discrete; a Huffman code will only be optimal when all symbol probabilities are powers of 1/2. Encoding groups of symbols can improve efficiency, but at a cost of exponential increase in tree size. Arithmetic coding is an improvement over Huffman coding in this respect.
Arithmetic coding is based on the concept of dividing the interval [0, 1) to sub-intervals, one for each letter of the alphabet. The length of each interval is set to equal the probability of the letter it represents. Thus, for the example shown in
An arithmetic encoder is initialized with a state variable representing the interval [0, 1). The first symbol in the block to be compressed is read and the interval is narrowed to the sub-interval representing the symbol to be encoded, say [0.5, 0.75) for B. When the next symbol is read, the current interval is again narrowed to represent the relative sub-interval corresponding to the second symbol. Thus, if the second symbol is A then [0.5, 0.75) is narrowed down to [0.5, 0.625). This procedure is repeated for each successive symbol, yielding an ever narrower interval. At the end of the block of symbols, the output of the encoder is the number within the final interval that requires the least bits to encode. The number of output bits will be in inverse proportion to the size of the final interval and therefore to the joint probability of the letters in the block.
The arithmetic decoder too is provided with the mapping of letters to intervals. As shown in
Practical implementations of arithmetic coding are designed to produce interim outputs from the encoder as well as decoder. Arbitrarily long sequences of symbols can therefore be encoded and decoded with bounded memory requirements and delay.
It is well known that the DNA of most organisms has unequal proportions of the four bases (e.g., approximately 0.3, 0.3, 0.2 and 0.2 for A, T, C and G in the human genome). Arithmetic coding makes use of this information to compress the bases of FASTQ reads.
Yet another algorithm used for NGS data compression is context encoding. In many situations, the value of a symbol in a block of symbols to be compressed will be statistically correlated with the value of the preceding symbol or symbols. E.g., quality scores along a read will tend to show relatively small changes from one symbol to the next. Context encoding makes use of this information to improve compression ratio.
When processing a symbol, a Huffman or arithmetic encoder can be instructed to pick one out of a number of probability distributions (sets of letter likelihoods). If this choice is made based on information that is also available to the decoder, the encoder and decoder can be run in lockstep and ensure correct reconstruction of the original symbol. E.g., past symbols will already have been decoded by the decoder and can therefore form such a context for encoding.
Returning to the example of quality score lines in a FASTQ file, and keeping in mind the tendency of quality scores to change slowly over a read, a ‘J’ might most likely be followed by a ‘J’. Thus, encoding the symbol following a ‘J’ might best be done with a probability distribution that peaks at ‘J’, while symbols following other letters might use different distributions. As long as the encoder and decoder are provided with the same set of rules, they use the same probability distributions and thus stay in sync.
Refining the context, e.g., by using more than one past symbol, may improve the accuracy of the probability model. However, this comes at the expense of encoder and decoder memory. If no context is used, the probability distribution of quality scores must specify 42 entries. With one past quality score symbol as a context this increases to 422, continuing to grow exponentially with the number of context symbols.
Yet another algorithm used for NGS data compression is adaptive encoding. In most situations, symbol probability distributions are not known exactly. Continuing with the quality scores example, while a skewed distribution that favors high values is most likely to better represent the data than a uniform distribution, its shape will vary with sequencing machine model and sample characteristics. Arithmetic coding can be enhanced to adapt to symbol probabilities that are unknown a-priory.
To operate adaptively, an arithmetic encoder keeps a table of running counts for each letter of the symbols' alphabet. When implementing context encoding adaptively, a separate such table is kept for each context (i.e., each possible set of context symbol values). Before encoding a symbol in a certain context, the current entries of that context' table are translated to a probability measure by dividing each by the sum of all table entries. These probabilities are used to encode the symbol as explained above, following which the count of the letter that was just encoded is increased. At start-up, the tables are initialized to a uniform (small) count that represents a uniform probability distribution or to another estimated distribution. A fixed deterministic rule is employed to perform periodic down-scaling of table entries to avoid overflow. The decoder maintains similar tables that are similarly initialized and normalized, and uses each decoded symbol to update the corresponding count. This ensures that the encoder and decoder operate in a coordinated fashion, adapting to the actual symbol probability distribution without requiring any side-information to be included in the compressed data.
Yet another algorithm used for NGS data compression is dictionary encoding. The ZIP file format, generated by a class of compatible lossless compression algorithms, is the best-known example of dictionary encoding. Its popularity stems from its fair performance on many types of data without requiring any characterization or prior information on that data.
As its name suggests, a dictionary encoder builds and maintains a dictionary of data sequences that it has previously encountered in the input data. At any point during the encoding process, the encoder will search for the longest dictionary entry that matches the next several symbols to be encoded. It will then, (i) encode these input symbols by the index into the dictionary of the matching entry; (ii) add a new entry into the dictionary created from the above matching entry concatenated with the immediately following symbol; and (iii) continue processing input data starting from the symbol immediately following the matching string. At initialization, the dictionary must contain an entry for each letter of the symbols' alphabet. A dictionary maintenance algorithm must run in parallel with encoding to keep its size under a pre-determined limit.
E.g., a dictionary encoder processing the binary stream
FASTQ files are often compressed by ZIP-compatible software such a gzip, which typically achieves a compression ratio of 2.5-3.5. As with other types of data, tuning the compression algorithm to its specific characteristics will improve compression.
G-SQZ pairs each FASTQ base with its corresponding quality score and uses Huffman coding to encode the combined two-part symbols. During a first pass over the file, the frequency distribution of all possible pair values is determined, serving as the basis for a Huffman code that is used during a second pass to encode the contents of read and quality score lines. Read identifiers are encoded separately, making use of recurring fields in adjacent identifiers.
KungFQ and FQC use an approach based on separately pre-processing the identifier, read and quality score lines; merging the three interim streams; and compressing the result with a ZIP-compatible encoder. Identifiers conforming to specific popular formats are delta encoded, but are left unchanged in other cases. Quality scores are encoded by run-length coding: long repetitions are encoded to the quality score value and number of repetitions. Bases are encoded by either triple-symbol word substitution or by run-length coding.
DSRC2 implements a selection of compression schemes: bases are encoded by either word substitution, Huffman coding or arithmetic coding. Quality scores are encoded by one of: Huffman coding in the context of position within the read; Huffman coding in the context of the preceding symbol or run-length of symbols; or, arithmetic coding in the context of preceding symbols.
SCALCE attempts to identify redundancy that exists in overlapping reads that come from a high-coverage sample. For this purpose, it uses Locally Consistent Parsing (LCP) to pre-process reads prior to encoding: for each read, LCP identifies the longest substring (or substrings) that it shares with other reads. Reads are clustered based on shared strings and, within the cluster, ordered by the position of the shared string within the read. Finally, the reads are encoded in the above order by a ZIP-compatible encoder.
Quip performs de novo assembly (assembly without a reference genome) of reads into ‘contigs’, which are relatively long, contiguous sections of DNA. Reads are then encoded to their position within a contig. De novo assembly generally relies on de Bruijn graphs, and is memory-intensive. Quip uses a probabilistic data structure that is more efficient, at the cost of occasional miscalculations of k-mer counts. These result in failed assembly but only mean less efficient encoding of the few affected reads.
When a FASTQ file contains reads from a known organism, then if a read: (i) can be mapped to the right location in that organism's genome; and (ii) contains a limited number of mutations or sequencing errors; then the read's base sequence can be efficiently encoded by the combination of its position within the reference genome and the set of mismatches between them. This is called referenced-based coding.
SlimGene and samcomp read a SAM/BAM file that reports on the mapping of reads to the reference genome and encode a read's mapping position and mismatches to a combination of opcodes and offset values.
Fastqz and LW-FQZip include ‘lightweight’ mappers that attempt to find the location of each read in a reference genome. Mapping is based on creating an index of k-mers appearing in the reference genome. Each read is scanned base-by-base for a k-mer that is present in the index and, if a match is found, the rest of the read is compared against the reference genome to identify mismatches.
To encode identifiers, unmapped reads and quality scores, fastqz uses ZPAQ, arithmetic coding software that includes a sophisticated toolbox of context modeling. ZPAQ is a bit-by-bit encoder, further slowed by its complex context modeling algorithm. To speed up encoding and decoding, fastqz includes a pre-processing step that marks up repeating prefixes in identifiers and pre-encodes runs of quality-score values.
As with FASTQ, specially tuned algorithms improve BAM compression ratio beyond gzip's.
SAMZIP separately optimizes the encoding of each alignment section tag, using combinations of delta, Huffman and run-length coding. Quality scores are run-length encoded.
NGC assumes availability of the reference genome and only explicitly encodes base mismatches. This is done by traversing read sequences ‘vertically’, i.e., in order of position in the reference genome and then read alignment position (i.e. starting position of the read). The mismatches are run-length encoded.
DeeZ encodes quality scores with an arithmetic coder. As to read sequences, DeeZ assumes that the majority of differences in bases between a read and its mapping locus on the reference genome are due to mutations (rather than sequencing errors) and are therefore shared with other reads mapped to the same locus. DeeZ obtains the ‘consensus’ of the reads mapped to a specific locus and encodes the differences between the consensus contigs and the reference genome only once.
CRAM defines a set of encoding algorithms that can be used to encode different SAM fields. The set includes beta (word substitution), run-length defined by either count or stop value, Huffman, Elias gamma (exponential), sub-exponential (linear and exponential sub-ranges), and Golomb or Golomb-Rice coding. In addition, CRAM allows the use of external data for rANS, a range-coder variant of Asymmetric Numerical System coding.
ADAM is one of several schemes that reformat a set of BAM files as a data structure that resembles a columnar database. This speeds up searches across multiple files for reads aligned to a given area in the genome. The columnar arrangement means that similar fields in a file and across files are now adjacent, providing an opportunity for more efficient representation. ADAM and similar schemes, however, are not compatible with BAM, requiring a re-write of file operations in all relevant applications.
Lossless compression is an invaluable tool in addressing the challenges inherent in the sheer volume of NGS data.
Various embodiments of the present invention losslessly compress FASTQ files at high compression ratios and fast processing rates, and are effective in application-transparent infrastructure products.
Identifier, read sequence and quality score lines are compressed by separately tuned algorithms, the outputs of which are combined into a single compressed file.
Identifier compression is completely general, while optimized for the most common format variations. Identifiers are tokenized and tokens are analyzed for common patterns such as repeats and constant increments. The result is then arithmetically encoded.
An internal, fast mapper maps read sequences to a reference genome. The mapper is orders of magnitude faster than BWA and Bowtie while achieving a 95% typical success rate for real-life data. For samples that do not come from a known organism, as well as for unmapped reads, the encoder uses a multi-variable, DNA-sequence optimized arithmetic coder.
Quality scores are encoded by an adaptive arithmetic coder that uses a complex, multi-variable context comprising, among others, preceding scores, position in the read and specific characteristics of different sequencing machine technologies.
Embodiments of the present invention perform encoding and decoding in streaming mode. This make is possible for a storage server, a cloud server and a transporter to pipeline encoding/decoding with file serving, reducing to a minimum start-up delay and application response time.
Alternative embodiments of the present invention losslessly compress BAM files, drastically reducing their size for storage and transport. These embodiments are of advantage for infrastructure products that remain transparent to applications, by serving them with native format unmodified BAM files.
Separately tuned algorithms are used to encode the various BAM tag-value fields.
The redundancy that exists between reads and alignment/mapping tags is efficiently eliminated by reference-based coding, without requiring the presence of the reference genome that was originally used in creating the BAM file.
Quality scores are encoded with a multi-variable context, adaptive arithmetic encoder. The present invention resolves the inherent conflict between, on the one hand, the large encoding block size needed to train complex adaptive encoders and, on the other, the small block size needed for efficient random access for file read.
There is thus provided in accordance with an embodiment of the present invention a computer appliance for storage, transfer and compression of next generation sequence (NGS) data, including a front-end interface communicating with a client computer via a first storage access protocol, a back-end interface communicating with a storage system via a second storage access protocol, a compressor receiving native NGS data from an application running on the client computer via the front-end interface, adding a compressed form of the native NGS data into a portion of an encoded data file or data object, and storing the portion of the encoded data file or data object in the storage system via the back-end interface, and a decompressor receiving a portion of an encoded data file or data object from the storage system via the back-end interface, decompressing the portion of the encoded data file or data object to generate therefrom native NGS data, and transmitting the native NGS data to the client via the front-end interface, for use by the application running on the client.
There is additionally provided in accordance with an embodiment of the present invention a non-transitory computer readable medium storing instructions, which, when executed by a processor of a computer appliance, cause the processor to, in response to receiving a write request from an application running on a client computer, obtain native NGS data from the application, read a portion of an encoded data file or data object from a storage system, modify the portion of the encoded data file or data object, including adding a compressed form of the native NGS data into the portion of the encoded data file or data object, and transmit the modified portion of the encoded data file or data object for storage in the storage system, and, in response to receiving a read request from the application, read a portion of an encoded data file or data object from the storage system, decompress the portion of the encoded data file or data object and generate therefrom native NGS data, and transmit the native NGS data to the application.
The present invention will be more fully understood and appreciated from the following detailed description, taken in conjunction with the drawings in which:
For reference to the figures, the following index of elements and their numerals is provided. Similarly numbered elements represent elements of the same type, but they need not be identical elements.
Elements numbered in the 1000's and 2000's are operations of flow charts.
The following definitions are employed throughout the specification.
APPENDIX A is a pseudo-code listing for the processing threads of
In accordance with embodiments of the present invention, systems and methods are provided for storage, transfer and compression of next generation sequencing (NGS) data.
Reference is made to
Use of cache enables “storage tiering”. If cache 400 has faster access time than storage system 300, e.g., through use of solid-state disk drives, then reads from cache 400 are faster than reads from storage system 300 due to their not requiring decompression, and due to the fast storage medium.
Although not shown in
Appliance 100 includes a front-end interface 120 communicating with client computer 200 using a first storage access protocol, such as inter alia a file access protocol or an object store protocol. In some embodiments of the present invention, front-end interface 120 is a network file system (NFS) interface. Appliance 100 also includes a back-end interface 130 communicating with storage system 300 using a second storage access protocol, such as inter alia a file access protocol or an object store protocol. In some embodiments of the present invention, back-end interface 130 is a Swift object store interface. The first and second storage access protocols may be the same or different protocols.
Appliance 100 also includes a compressor 140 and a decompressor 150. Appliance 100 provides several services to client computer 100, including customized compression of NGS data in a manner that is transparent to NGS application 220. Specifically, using appliance 100 as an intermediary, NGS application 220 processes native, NGS data, whereas storage system 300 stores encoded NGS data files or data objects.
Compressor 140 is programmed to receive native NGS data from application 220 via front-end interface 120, to add a compressed form of the native NGS data into a portion of an encoded data file or data object, and to store the portion of the encoded data file or data object in storage system 300 via back-end interface 130. Decompressor 150 is programmed to receive a portion of an encoded data file or data object from storage system 300 via back-end interface 130, to decompress the portion of the encoded data file or data object to generate therefrom native NGS data, and to transmit the native NGS data to client 200 via front-end interface 120, for use by application 220.
It will be appreciated by those skilled in the art that there are many variations to the architecture of
In a cloud-based embodiment of the present invention, appliance 100 is a virtual appliance that runs in a cloud environment such as that provided by Amazon Technologies, Inc. of Seattle, Wash.; client computer 200 is a virtual computer such as that provided as an Elastic Compute Cloud (EC2) instance by Amazon; storage system 300 is a cloud storage system such as Simple Storage Service (S3) provided by Amazon; and cache 400 is a cloud-based storage system such as an Elastic Block Storage (EBS) service provided by Amazon. In a cloud-based embodiment of the present invention, front-end interface 120 communicating with client computer 200 over a virtual private cloud (VPC) local area network (LAN) using a first storage access protocol, such as inter alia a file access protocol or an object store protocol. In some embodiments of the present invention, front-end interface 120 is a Network File System (NFS) interface. In a cloud-based embodiment of the present invention, back-end interface 130 communicates with cloud storage system 300 using a second storage access protocol. In some embodiments of the present invention, back-end interface 130 is an Amazon S3 interface. Appliance 100 presents virtual computer 200 with an NFS server interface that allows unmodified Portable Operating System Interface (POSIX)-compliant application 220 to read and write native NGS files.
In accordance with an embodiment of the present invention, appliance 100 is a compressed file server that manages a FASTQ-optimized and BAM-optimized, compressed file system on any 3rd party NAS 300.
In accordance with an embodiment of the present invention, appliance 100 operates as a ‘bump-in-the-wire’ on the file services connection between user applications 220 and NAS 300. Appliance 100 may use an NFS back-end interface 130 to manage a compressed file system using dedicated storage capacity on the NAS, and an NFS front end interface 120 that provides file system services to applications. FASTQ files and BAM files are stored on NAS 300 in compressed format, while user applications read and write the same files in their native format, with appliance 100 performing on-the-fly compression and de-compression. File types other than FASTQ and BAM pass through appliance 100 in their native, unmodified format.
Embodiments of the present invention support NFS version 3. MOUNT procedures are supported over UDP and TCP, and other NFS procedures are supported over TCP.
Provisioning appliance 100 involves (i) configuring the NAS to export to appliance 100 a sub-tree of the file system that is managed by appliance 100; and (ii) configuring client computer 200 to mount the same sub-tree from appliance 100.
The NAS-resident compressed file system is POSIX compliant, and may be architected, provisioned, backed-up and otherwise managed using existing generic tools and best-practices.
Appliance 100 acts as a remote procedure call (RPC) proxy for NFS traffic associated with client computer access to the relevant part of the NAS. Front-end interface 120 terminates NFS-carrying TCP connections, or receives the associated UDP packets from client computer 200. Appliance 100 parses RPC calls and NFS commands to determine which NFS commands may be passed through essentially unmodified and which involve FASTQ or BAM file operations and therefore require modification. Modified or unmodified NFS commands are then forwarded to the NAS 300 through back-end interface 130. The same general procedure is applied to NFS replies in the reverse direction.
Most NFS commands, such as inter alia file operations on non-FASTQ and non-BAM files, pass through appliance 100 with no modification to any of the NFS fields. Changes made to such commands include: (i) at the RPC level, calls are renumbered by overwriting the XID field, with the original values restored in the replies; and (ii) calls to and replies from NAS 300 are carried over a different TCP connection or UDP packet, which have appliance 100 as their source.
Some NFS commands require additional NFS-level changes to the command or to the reply. E.g., a READDIR reply reporting on a directory that contains compressed FASTQ or BAM files is edited to show uncompressed files.
FASTQ or BAM file read or write requests trigger operation of compressor 140 and decompressor 150, which use back-end interface 130 to access NAS 300 in order to compress or decompress the relevant file before servicing the request, as further explained hereinbelow.
In order that appliance 100 be transparent to NGS application 220, appliance 100 must receive native NGS file commands from application 220, and issue commands to storage system 300, adapting them to account for storage system 300 storing encoded files and not native NGS files, receive responses from storage system 300, adapt them, and send native file responses to application 220. Details of adapting specific NGS commands are described hereinbelow with reference to
Front-end interface 120 intercepts NFS READ commands from a FASTQ or BAM file and queues them. Front-end interface 120 notifies compressor 140 and decompressor 150 of the FASTQ or BAM file that is being read, and indicates the range of data in the file that has been requested so far. Decompressor 150 uses back-end interface 130 to read the compressed FASTQ or BAM file, to decompress the data, and to write the result to a native-format FASTQ or BAM file that resides on cache 400 and is one-to-one associated with the compressed file. As decompression progresses, decompressor 150 periodically communicates to front-end interface 120 the range of data that has been decompressed thus far. As READ commands continue to arrive, front-end interface 120 updates decompressor 150 as to new data that is requested. In parallel, when uncompressed file data becomes available for serving queued commands, front-end interface 120 forwards these commands to cache 400 and relays the replies to client computer 200.
READ commands and replies relating to non-FASTQ and non-BAM files are passed transparently by appliance 100.
NFS CREATE and WRITE commands of a FASTQ or BAM file are relayed by front-end interface 120 to cache 400 as commands on a native-format FASTQ or BAM file that is one-to-one associated with the intended compressed file. The end of the file is detected by a time-out since the last WRITE command. On time-out, front-end interface 120 notifies compressor 140, which compresses the cached file to a stable, compressed file in storage system 300.
CREATE and WRITE commands and replies to and from non-FASTQ and non-BAM files are passed transparently by appliance 100.
Uncompressed FASTQ and BAM files—either the input to compressor 140 or the output of decompressor 150—are cached in their native format and used to service subsequent READ commands. Appliance 100 runs a cache management process that keeps the total size of cached files below a configurable limit by deleting least recently accessed files.
Reference is made to
Proxy process 111 runs a permanent server thread 116 that listens on the NFS server and mount ports. Server thread 116 creates a new connection thread 117 to handle each incoming client computer connection. Connection threads 117 implement the RPC proxy functions for their associated connections, including communication with client computer 200 and storage system 300 over the incoming and outgoing TCP connections, respectively.
When triggered by client computer NFS commands such as READ or WRITE, requests for file compression or decompression are sent from a connection thread 117, through server thread 116, to compression/decompression process 112.
Compression/decompression process 112 has a permanent master thread 118 that accepts, from server thread 116, requests for file compression and decompression. When receiving a compression or decompression request, master thread 118 creates a new compression/decompression thread 119 that performs compression or decompression of the file.
Within proxy process 111, server thread 116 and connection threads 117 use POSIX message queues to pass information related to operations on FASTQ and BAM files, such as inter alia read request data ranges and decompress progress reports.
Inter-process communication between proxy process 111 and compression/decompression process 112 is implemented by proprietary messages passed over Linux FIFOs. When a new FASTQ or BAM file compression or decompression process needs to be started, server thread 116 of proxy process 111 sends an appropriate message to master thread 118 of compression/decompression process 112, which creates a new compression/decompression thread 119 to perform the requested task. From that point on, messages related to compression or de-compression are exchanged directly between server thread 116 and the relevant compression/decompression thread 119.
Server thread 116 consolidates READ requests on a per-file basis and communicates the result to a FIFO read by the relevant compression/decompression thread 119. (In distinction, file WRITE only involves one initial request, which triggers creation of compression thread 119.) Indications such as READ progress and WRITE completion are communicated from compression/decompression thread 119 to the server thread's FIFO and from there to all connection threads 117 that have pending requests related to that file.
Appliance 100 creates and manages an NFS compliant, compressed file system stored on NAS 300 and cache 400. The compressed file system mirrors the directory structure of the original, uncompressed file system, using the original directory names. Non-FASTQ and non-BAM files are stored in the compressed file system at their original location, in their original name and in their original, unmodified format.
A FASTQ or BAM file is represented in the compressed file system by two files, as follows.
Appliance 100 indexes files and directories based on their NFS file handle. This makes it possible for file and directory identification to be permanent throughout their lifetime, independent of name changes or moves.
Each encoded FASTQ and BAM file has a state associated therewith, from among the five states ‘writing’, ‘compressing’, ‘compressed’, ‘decompressing’, ‘uncompressed’. Reference is made to
A FASTQ and BAM table data structure is used to track the state and attributes of cached FASTQ and BAM files. Table lookup is done by the file's NFS handle, through a hash. The table stores the following file information:
Appliance 100 manages files and directories based on their NFS file handle. File and directory path names are useful for administrative purposes. To support these path names, appliance 100 maintains a tree data structure that mirrors the directory and file structure of the NAS-resident file system it manages. Each node in the tree contains the following information:
The tree data structure provides information for all files in the system, including non-FASTQ files. Appliance 100 maintains a look-up table data structure that maps an NFS file handle, through a hash, to a node in the tree data structure.
To maintain file system integrity across system reboots, a non-volatile copy of the FASTQ or BAM table is stored in a NAS-resident file. The table is loaded from the file at boot, and any changes to it during system operation are committed to stable storage as journal entries appended to the file. Following the next system boot, and once all journal entries have been incorporated into the memory-resident table, the updated table is written back to permanent storage as a new version of the file.
The tree data structure and associated look-up table are used for administrative purposes. As such, they are not kept in permanent storage. It is the responsibility of connection threads 117 to update the data structure as they process NFS commands such as LOOKUP and READDIR. Following system boot, an internal NFS client traverses the compressed file system tree, generating in the process NFS commands that, through processing by the respective connection thread 117, drive initial population of the tree.
Reference is made to APPENDIX A, which is a pseudo-code listing of thread processing for server thread 116 (lines 84-118), connection thread 117 (lines 1-83), master thread 118 (lines 119-130) and compression/decompression thread 119 (lines 131-150), in accordance with an embodiment of the present invention.
As indicated at line 5 of APPENDIX A, once created by server thread 116 to serve an NFS client computer connection, a connection thread 117 accepts inputs from (i) the two TCP sockets associated with client computer 200 and storage system 300 connections, and (ii) the message queue used to receive messages from server thread 116.
As indicated at lines 6 and 7 of APPENDIX A, RPC calls, carrying NFS commands, received from client computer 200 are renumbered by overwriting their XID field. As indicated at line 44 and 45 of APPENDIX A, the field is then restored to its original value in replies forwarded back to client computer 200. (Without renumbering, RPC calls arriving from different client computers that carry the same XID could be interpreted by the NFS server as retransmissions of the same call, since they arrive from the same IP address.)
Processing of NFS commands and replies received from client computer 200 and storage system 300 connection sockets, respectively, depends on the type of command and type of file addressed, as follows.
As indicated at lines 8-11 of APPENDIX A, for MOUNT commands, the path to the root of the file tree to be mounted is validated against the original file system export as configured into appliance 100. If correct, the command is forwarded to NAS 300. As indicated at lines 46-49 of APPENDIX A, The reply is forwarded to client 200 after recording the file handle of the root of the tree. The handle is used to build the tree data structure.
As indicated at lines 12-14 and 50-52 of APPENDIX A, general file system commands such as FSSTAT and FSINFO and the replies associated with them are passed unmodified by appliance 100.
For NFS commands addressing files, appliance 100 determines through the file name or file handle whether or not it is a FASTQ or BAM file. For commands that address the file by name, e.g., CREATE, the classification is based on the file name's suffix, ‘.fq’ or ‘.fastq’ or ‘bam’ vs. other. For commands that specify the file by handle, the file type is identified by searching the FASTQ or BAM table to determine whether or not the file is listed in it.
As indicated at lines 15-20 and 53-55 of APPENDIX A, commands addressing non-FASTQ or non-BAM files, and the replies associated with them, are passed unmodified by appliance 100. In the process, file and directory names specified by CREATE, RENAME, REMOVE, LOOKUP, READDIR or READDIRPLUS commands are used to update the tree data structure and associated look-up table.
As indicated at lines 21-32 and 56-69 of APPENDIX A, commands addressing directories or FASTQ or BAM files are processed and potentially modified by appliance 100 as follows.
GETATTR (lines 22 and 57 of APPENDIX A): When addressing a FASTQ or BAM file, the command is forwarded to NAS 300, which responds with the attributes of the cached FASTQ or BAM file. Appliance 100 modifies the reply to show: (i) the original uncompressed file size; and (ii) the true last-modified time stamp, disregarding decompression, which re-writes the cached FASTQ or BAM file.
SETATTR (lines 23 and 58 of APPENDIX A): Setting the file size attribute of a FASTQ or BAM file or any attribute of a compressed file are not permitted. Commands for other changes are forwarded to NAS 300 unmodified and, if reported in the reply as accepted, are forwarded to client 200 and used to update the FASTQ or BAM table.
LOOKUP (lines 24 and 59of APPENDIX A): The command is forwarded to NAS 300 and the reply back to client 200. The information in the reply is used to update the tree data structure.
ACCESS, READLINK, SYMLINK (lines 25-27 and 60-62 of APPENDIX A): The command and reply are passed unmodified.
MKDIR (lines 28 and 63 of APPENDIX A): The command and reply are passed unmodified. If the operation is successful, the information in the reply is used to add an entry to the tree data structure.
REMOVE (lines 29 and 64 of APPENDIX A): The command and reply are passed unmodified. If the operation is successful, a remove message is sent to the server thread, serving as a trigger to cleanup.
RMDIR (lines 30 and 65 of APPENDIX A): The command and reply are passed unmodified. If the operation is successful, the associated entry is removed from the tree data structure.
RENAME (lines 31 and 66 of APPENDIX A): The command and reply are passed unmodified. If the operation is successful, the information in the reply is used to update the tree data structure.
READDIR, READDIRPLUS (lines 32, 33 and 67-69 of APPENDIX A): The command is passed unmodified. The reply is modified (i) to show access permissions of 000 for compressed files; and (ii) to show the original uncompressed size and true last-modified timestamp for FASTQ or BAM files.
As indicated at lines 34-43 and 70-76 of APPENDIX A, NFS commands related to the creation, writing to and reading from FASTQ or BAM files involve the following process.
CREATE (lines 34, 70 ad 71 of APPENDIX A): Appliance 100 tracks the amount of free space available in cache 400 for uncompressed cached files. If no space is available the CREATE command is rejected with an out-of-space error code. Otherwise, the command and reply are passed unmodified. If the operation is successful, an entry is added to each of the data structures and a CREATE message is sent to server thread 116.
WRITE, COMMIT (lines 35, 36, 72 and 73 of APPENDIX A): These commands are accepted only when the file is in the ‘writing’ state. In the ‘writing’ state, the commands and replies are passed unmodified and the last-modified timestamp field in the FASTQ or BAM table is updated to support start of compression on timeout.
READ (lines 37-43 of APPENDIX A): When the file is in the ‘writing’, ‘compressing’ or ‘uncompressed’ state, READ commands are passed unmodified. Otherwise, when the file is in the ‘compressed’ or ‘decompressing’ state, connection thread 117 compares the range of file data that is requested by the READ command with the range of data that is available for reading as appears in the FASTQ or BAM table entry for the file. If the requested data is available, the command is forwarded to NAS 300. Otherwise, the command is queued and a decompress message for the file is sent to server thread 116, specifying the new data range requested for reading from the file.
Connection thread 117 receives, through the POSIX message queue, reports from server thread 116 on the progress of file decompression. Upon receiving such a report, connection thread 117 scans the READ command queue and releases for forwarding to the NFS server requests for data that has become available.
As indicated at line 74 of APPENDIX A, READ replies are forwarded back to client 200. The last-accessed time stamp in the FASTQ or BAM table is updated with the access time to support cache management.
As indicated at line 80 of APPENDIX A, connection thread 117 also accepts ABORT messages from server thread 116, signaling that the file being read has been deleted by another connection thread 117. When an ABORT message is received, connection thread 117 replies to all queued READ commands with an error reply.
Server thread 116 maintains four lists of files that are currently in one of the four transitional states: ‘writing’, ‘compressing’, ‘decompressing’ and ‘uncompressed’. (The stable state is ‘compressed’.) In the ‘decompressing’ list, the entry for each file includes the list of connection threads 117, identified by their message queues, which are waiting to read data from the file.
Server thread 116 receives inputs from (i) the message queue that accepts messages from connection threads 117, and (ii) the FIFO that receives messages from threads 119 of the compression/decompression process 112.
As indicated at lines 89-100 of APPENDIX A, server thread 116 accepts three types of messages from connection threads 117, as follows.
As indicated at lines 101-107 of APPENDIX A, while the ‘writing’ list is not empty, server thread 116 uses an idle timer to periodically scan the list and determine for which files, if any, a suitable time period has elapsed since the last WRITE, and compression should therefore start. For each such file, a COMPRESS request is sent to compression/decompression process 112, and the file is moved to the ‘compressing’ list.
Server thread 116 accepts four types of messages from compression/decompression process 112 through the FIFO, as follows.
As indicated at line 123 of APPENDIX A, master thread 118 of compression/decompression process 112 waits for messages arriving at its input FIFO from server thread 116. As indicated at lines 124-126 of APPENDIX A, two types of messages are accepted; namely, COMPRESS request and DECOMPRESS request. A COMPRESS request specifies the NFS file handle of the cached FASTQ or BAM file. A DECOMPRESS request specifies the full pathname of the compressed FASTQ file as well as the range of data that should be decompressed in the file.
As indicated at lines 125-127 of APPENDIX A, when a COMPRESS or DECOMPRESS request is received, master thread 118 creates a compression/decompression thread 119 to perform compression/decompression of the file. The newly created thread 119 is provided with an identity, handle or name, of the file to be processed as well as, in the case of decompression, the range of data that should be decompressed in the file.
As indicated at lines 134-140 of APPENDIX A, a compression/decompression thread 119 that was created for compression of a file creates the compressed file and then sends a COMPRESS START report to server thread 116 of proxy process 111. The message includes the name of the thread's input FIFO. When the thread 119 completes compression of the file, it sends a COMPRESS END report to server thread 116.
As indicated at lines 141-149 of APPENDIX A, a compression/decompression thread 119 that was created for decompression of a file sends a DECOMPRESS START report to server thread 116 of proxy process 111. The message includes the name of the thread's input FIFO. In the course of decompression, the compression/decompression thread 119 sends periodic DECOMPRESS reports to server thread 116, reporting on new file data that has been decompressed and is therefore available for reading. A flag in the report indicates when thread 119 has reached the end of the file.
During decompression, compression/decompression thread 119 accepts from server thread 116 DECOMPRESS requests that specify more data to decompress.
Reference is made to
At operation 1005, the system receives a FILE WRITE command from an NGS application, such as inter alia NGS application 220 of
At operation 1035 the system adds the native NGS data specified at operation 1005 to the thus-decompressed last portion. At operation 1040 the system marks the beginning of the last portion of the cached copy as the start of new data. At operation 1045 the system deletes the last portion of the encoded file or data object.
At operation 1050 the system decompressed the native NGS data that was received at operation 1005. Operation 1050 is optional, and is only performed if the native NGS data is in a native format that is compressed, such as the BAM format, and not for native formats that are uncompressed, such as FASTQ. For BAM files, operation 1050 decompresses the ZIP block levels of the native files. At operation 1055 the system compresses the decompressed portion to a temporary buffer. At operation 1060 the system appends the contents of the buffer to the encoded data file or data object in the cache. At operation 1065 the system writes the encoded data file or data object in the cache to the storage system. At operation 1070 the system receives an acknowledgement from the storage system. At operation 1075 the system sends a file write acknowledgement to the NGS application.
If, at decision 1020, the system decides that the last portion of the encoded data file or data object is full, then processing advances directly to operation 1050. If, at decision 1025, the system decides that a copy of the last portion of the encoded file resides in the cache, then processing advances directly to operation 1040.
Reference is made to
At this stage all required portion(s) are in the cache. If the system determines at decision 1130 that all of the required portion(s) are in cache, then the method advances directly to operation 1180. At operation 1180 the system reads the requested native NGS data from the cache. At operation 1190 the system sends the requested native NGS data to the NGS application.
Reference is made to
Reference is made to
Reference is made to
Reference is made to
Reference is made to
Reference is made to
Reference is made to
Reference is made to
Reference is made to
Reference is made to
Although
Referring back to
The input file may contain consecutive, complete FASTQ records. A FASTQ record has four fields, each occupying one line of text. The fields appear in the following order: (1) identifier, (2) read, (3) +identifier, and (4) quality scores. The lines are terminated by a line-feed character, without ‘carriage return’. Records are similarly separated by a ‘line feed’.
Identifiers start with a ‘@’ character, followed by up to 255 printable ASCII characters. Compression is optimized for names which are composed of tokens separated by non-alphanumerical characters such as ‘ ’ (space) and ‘:’ (colon). Each token is either alphanumerical or numerical.
Read lines contain a combination of the base-representing characters ‘A’, ‘C’, ‘G’, ‘T’ and ‘N’ (unidentified). Any of the characters in a read may be either lower case or upper case. Preferably, at decompression all bases are converted to upper case. Read lines may be up to 4095 bases long. Read lines may vary in length.
The third line of the FASTQ record consists of a ‘+’ character, optionally followed by an identifier. The identifier, if present, must be identical to the identifier in the first line.
The quality score line must be equal in length to the read line.
Quality score lines contain ASCII characters with a numerical value that is greater than or equal to 33 (the ‘!’ character) and smaller than or equal to 74 (the ‘J’ character).
The compression of reads is assisted by a reference genome. Appliance 100 may include a human genome (hg19) as reference.
Each input FASTQ file is compressed to a single compressed output file.
The output file is a binary file. The output file starts with a header that consists of a sequence of type-length-value fields that provide the following information.
Compressor 140 receives as input a native NGS data file or data object from NGS application 220, and generates as output a portion of an encoded data file for storage in data storage 320. Decompressor 150 receives as input a portion of an encoded data file from data storage 320 and generates as output a native NGS data file or data object.
Reference is made to
Operations 2250 and 2260 are performed by a data decompressor, such as decompressor 150 of
Reference is made to
Operations 2360 and 2370 are performed by a data decompressor, such as decompressor 150 of
Reference is made to
If at decision 2415 or at decision 2425 the compressor decides that the mapping is successful, then at operation 2435 the compressor encodes the read by location within the reference genome and by differences in the read vis-à-vis the reference genome, and the compressed data is written to an encoded data file or data object.
Operations 2440-2455 are performed by a data decompressor, such as decompressor 150 of
If, at decision 2440, the decompressor decides that the read was not compressed using the reference genome, then at operation 2455 the decompressor decompresses the read without use of the reference genome, and outputs the decompressed data to the native NGS file or data object.
Reference is made to
Operation 2530 is performed by a data decompressor, such as decompressor 150 of
Reference is made to
At operation 2620 the compressor re-orders the portions so as to improve the compression ratio that can be obtained for one or more of the portions. At operation 2630 the compressor compresses each portion and writes the compressed data to an encoded data file or data object.
Operations 2640-2670 are performed by a data decompressor, such as decompressor 150 of
Reference is made to
Operations 2750-2770 are performed by a data decompressor, such as decompressor 150 of
Reference is made to
Operations 2850-2890 are performed by a data decompressor, such as decompressor 150 of
Reference is made to
Operations 2910-2940 are performed by a data compressor, such as compressor 140 of
Operation 2950 is performed by a data decompressor, such as decompressor 150 of
The appliance of
Reference is made to
In embodiments of the present invention, appliance 500 has the specifications provided in TABLE I hereinbelow. Front panel 520 includes the interfaces provided in TABLE II hereinbelow. Back panel 540 includes the interfaces provided in TABLE III hereinbelow.
Appliance 500 has four network interfaces, as follows: two 100 Mbps/1 Gbps/10 Gbps Ethernet SFP+ interfaces on the back panel named eth0 and eth1, respectively; and
In a typical installation
Reference is made to
E.g., the two files defining eth0 and its alias eth0:0 may be as follows:
The Privileged Address for eth0
Appliance 500 may be configured to mount the NAS compressed file system by adding the following line to /etc/fstab:
The appliance 500 proxy may be configured using a configuration file similar to the example provided below. Lines starting with ‘#’ are comments. Other lines consist of space-separated name-value pairs.
Reference is made to
The file list in
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made to the specific exemplary embodiments without departing from the broader spirit and scope of the invention. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
This application is a national phase entry of international application PCT/IL2016/050455, entitled STORAGE, TRANSFER AND COMPRESSION OF NEXT GENERATION SEQUENCING DATA, filed on May 2, 2016 by inventors Dan Sade, Shai Lubliner, Arie Keshet, Eran Segal and Itay Sela. PCT/IL2016/050455 claims benefit of U.S. Provisional Application No. 62/164,611, entitled COMPRESSION OF GENOMICS FILES, and filed on May 21, 2015 by Shai Lubliner, Arie Keshet and Eran Segal, the contents of which are hereby incorporated herein in their entirety. PCT/IL2016/050455 claims benefit of U.S. Provisional Application No. 62/164,651, entitled STORAGE OF COMPRESSED GENOMICS FILES, and filed on May 21, 2015 by inventors Danny Sade and Arie Keshet, the contents of which are hereby incorporated herein in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IL2016/050455 | 5/2/2016 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62164651 | May 2015 | US | |
62164611 | May 2015 | US |