The present invention is in the field of data storage technologies and is particularly related to molecular data storage systems and methods, such as DNA based data storage.
In recent years, various DNA based data storage systems have been developed. Such systems are advantageous because of their remarkable data density and long-term stability of DNA. The first demonstrations of DNA based data storage, on a megabyte scale, were revealed in 2012 in two independent studies[1], [2]. In a recent work, the Shannon information capacity of DNA was demonstrated, using fountain code error correction, to be ˜1.57 bit per synthesized position [3].
References considered to be relevant as background to the presently disclosed subject matter are listed below:
Acknowledgement of the above references herein is not to be inferred as meaning that these are in any way relevant to the patentability of the presently disclosed subject matter.
There is a need in the art for a novel approach to molecular based data storage techniques, e.g. DNA-based storage systems, with improved data storage capacity/density.
Indeed, current DNA synthesis and sequencing technologies process large numbers of nominally identical molecules in parallel [4], [5], which leads to significant information redundancy that is inherent in current DNA based storage schemes.
The present invention utilizes a composite letter alphabet approach, in which each letter (composite letter) is defined by a predetermined mixture of molecular bases (i.e. mixture of one or more basic molecular building blocks which are used in the data storage or tagging/marking/labeling agent, e.g. mixture of DNA base types) and thereby leverages this information redundancy and enables higher information capacity. The invention is based on the inventors' understanding of the mathematical properties of composite DNA letters and how this can be used for a coding scheme to be used in composite DNA based storage systems. The principles of the invention are not limited to DNA-based storage systems, but are relevant for any molecular-sequence-based storage system and can be implemented by utilizing the composite letter alphabet approach with any suitable types of molecular bases to form a molecular-sequence-based storage system. Storage should also be understood to include using the synthesized material for tagging/marking and labeling purposes. To this end, the material composition (the population of molecules) used for storing data, according to the present invention, may also be used as a tagging/marking/labeling agent, namely as a data carrying marking/identification composition which may be included in, or used as, a tag, marker or label.
For clarity it should be noted that the terms bases and monomers are used herein interchangeably to designate the basic molecular units/building-blocks used in the molecular-sequence-based storage system of the present invention. Sequences of such basic molecular units/building-blocks (e.g. sequences of monomers), which are also referred to herein interchangeably and without loss of generality as polymers, are used for storing information in the molecular-sequence-based storage system. In this connection it should be noted that data is encoded in the data storage by the arrangement of monomers (i.e. basic molecular building-blocks/units) in the molecular sequences/strands. To this end, it should be understood that the molecular strand/sequence as used herein designates a molecule formed as, or at least including in its data storage section, a chain or string of the predetermined basic molecular building-blocks which are used by the storage system of the present invention for storing data. Accordingly, the following phrases may be used herein interchangeably to designate the molecular sequences/strands whose populations are used according to the present invention for storing data: molecular sequences molecular strands, molecular chains, molecular strings. Additionally, these molecular sequences/strands are also sometimes referred to herein as polymers, and it should be understood that the term polymer used in this connection is not limited to repeatable sequence of building-blocks/monomers, but designates any sequence of building-blocks/monomers and in fact the order of the building block types in the sequence may directly or statistically encode information. To this end, the term polymers used herein should be interpreted broadly as designating any sequence of basic molecular units/building-blocks (monomers) and not only repeated/periodic sequences. Accordingly, the term monomers should be interpreted broadly as designating basic molecular units/building blocks which are not necessarily repeated in any periodicity in the molecular strands/sequences (e.g. polymers) of the data storage.
Some embodiments of the storage system of the invention are more specifically described/exemplified here with respect to molecular sequences/strands (e.g. hereinafter also polymers) constructed of the A, G, C, T DNA nucleotides, which serve here as the basic molecular units/building-blocks (e.g. hereinafter for short monomers) of the molecular-sequences/polymers of the data storage system.
Other embodiments of the storage system of the invention are more specifically described/exemplified here with respect to molecular strands/sequences (e.g. hereinafter referred to as polymers) constructed of basic molecular units/building-blocks (e.g. monomers), which may be formed as short oligomers (i.e. oligos) of preselected compositions of basic units (e.g. nucleotides or other basic units/elements/bases). For instance such oligos of preselected compositions may be characterized by certain predetermined number/length and order of the basic units, which may be for instance A, G, C, T nucleotides. For instance, the basic building-blocks/monomers may be preselected oligos of length 3 such as: A-G-T; A-T-C; G-A-C; G-C-T; T-G-A; T-C-A; C-A-T; and C-A-G.
It should be understood that the invention is not limited to these specific sets of basic building block types (e.g. monomer types) nor to DNA/RNA chemistry or nucleotides and can be implemented with the building blocks/monomers constructed or formed with other types of chemical moieties (e.g. or other types of, possibly synthetic, “nucleotides” as well as motif types such as trinucleotides, dinucleotides, pentanucleotides, 20-mers). In this regard, it should be noted that in order to enable efficient writing (synthesis) and reading (sequencing) of a data encoding population of molecular sequences/strands, the basic molecular building-blocks which are used by the technique of the present invention relatively short k-mers (oligomers) having a limited number of bases/basic-units/elements, having typically no more than few tens of basic units, preferably ranging from a single basic unit and up-to not more than 20 bases/basic units/elements (e.g. k-mers in which k is within the range from 1 to 20), and even more preferably are preferably having only few bases/basic-units (i.e. less than ten), such as 3 basic units. This is because longer oligonucleotides will lead to longer unnecessary sequencing and will complicate the synthesis process and its stability. To this end, in some implementations it may be most convenient to implement the present invention with basic molecular building-blocks formed as single-mers (such as the A, G, C, T nucleotides/nucleobases of the DNA the A, G, C, U nucleotides/nucleobases of the RNA, or other chemistry, e.g. of synthetic polymers). Alternatively in some embodiments resilience and error correction improvement can be achieved basic molecular building-blocks constructed as short multi-mers, with preferably not more than 20 bases/basic units/elements (in this case practically the use of short tri-mers, i.e. having 3 bases, e.g. trinucleotides, may be preferred for practical reasons).
In this regard, a conceptual distinction may be clarified here between the terms basic molecular building-blocks (also indicated as building-blocks), which is used herein to designate the basic building block unit which is used to encode data in the molecular data storage of the present invention, and the terms bases/basic-units/elements which are used herein merely to exemplify that the basic molecular building-blocks used by the technique of the present invention, and which may be predetermined types of short k-mers/oligos formed with a predetermined number of bases/basic-units/elements. In this connection, it should also be noted that according to the present invention all the basic molecular building-blocks which are used according to the technique of the present invention (i.e. all the types of basic molecular building-blocks used for encoding data in a given population of a given molecular data storage system constructed according to the invention) have the same number of bases/basic-units/elements, which, as stated above, is short and is preferably between 1 and 20 bases, and even more preferably having only few bases e.g. 3.
The inventors performed a molecular proof of concept implementation demonstrating the feasibility of the composite alphabet approach-based data storage systems. Performance parameters obtained via analysis of small scale experimental results, in which the DNA nucleotides where used as the basic building blocks for composite alphabet of 15 letters, demonstrated that the system of the invention can achieve information capacity of ˜4.3 bits per synthesized position using the DNA building blocks. This presents a significant improvement as compared to the state of the art molecular-based, as well as magnetic media, storage systems. Also it should be understood that this result was achieved utilizing an alphabet of 15 letters only, and that higher data densities may be achieved with composite alphabets of higher resolutions (a larger alphabet having more letters).
The composite alphabet approach can for example be incorporated into and combined with existing DNA based storage and tagging/labeling schemes. Due to significant DNA synthesis vs. sequencing cost differences this leads to substantial potential cost reduction.
To this end, as clarified in more detail below, a composite letter is a representation of a position in a sequence that constitutes a mixture of one or more types of molecular building-blocks used in molecular data storage. For instance, in DNA based molecular data storage, according to the present invention, the composite letter is a composite DNA letter a defined by pre-determined ratio (also termed herein as probability-vector or frequency-vector) σ=(PA,PC,PG,PT) of the standard monomer/base types of DNA (A, G, C and T DNA nucleotides). Writing a composite DNA letter at a given position of a DNA sequence is equivalent to producing (synthesizing) multiple copies (oligonucleotides) of the sequence so that in this given position the different DNA nucleotides are distributed across the synthesized copies according to the probability-vector (frequency-vector) of the respective letter a.
The present invention provides a novel technique (system and methods) for synthesizing populations of molecular strands/sequences (also referred to herein as populations of molecular strands/sequences), each population being designed/configured for representing/encoding a sequence of composite letters by which data is encoded, e.g. a single sequence). In this connection, the phrase sequence of composite letters by which the data is encoded, according to the technique of the present invention, should be understood as any sequence that essentially includes composite letters, and may or may not include simple letters (according to the definition of these terms below). For that matter the alphabet used for encoding the data may include only composite letters, or may also include simple letters in addition to the composite ones.
To this end, a distinction should be made between the sequence of composite letters which is represented by each population of the molecules, and the molecular strands/sequences of the population which includes respective chains/strings of the basic molecular building blocks. It should be understood that the sequence of composite letters is determined-by/associated-with the statistics of the types of basic building-blocks arranged in the molecular strands/sequences of the population. In other words each molecular strand/sequence of basic building-blocks does not by itself designate the sequence of letters, but each letter in the sequence of letters is determined-by/associated-with the statistics of the types of basic building blocks at a corresponding position in the strands of the plurality of the molecular strand/sequence.
The synthesis of each such population may be implemented by utilizing various technologies of molecular strand/sequence synthesis (e.g. [5]) while implementing suitable modifications/adaptations to such technologies, as described in detail below, in order to enable production of populations of molecular strands/sequences that define sequences of composite letters by which the data is encoded.
Reading a composite letter may be achieved by the following steps: (a) sequencing of multiple independent molecules of a population representing the certain composite letter sequence; (b) per each position in the sequenced independent molecules, determining the occurrence, e.g. the probability/frequency of occurrence, of each type of basic molecular building block in that position in the set of multiple independent molecules of the population which has been sequenced; and (c) inferring the encoded composite letter in each such position by matching the observed occurrences, e.g. the probabilities/frequencies of occurrences of the different basic molecular building block in that position, with the original ratio or composition of basic building blocks (i.e. the occurrence vectors or the probability/frequency vectors) which define the composite letters of the alphabet, and inferring the encoded composite letter in that position by determining such a match (e.g. best match). Accordingly by repeating (b) and (c) for all valid positions in the multiple independent molecules from a given population, a sequence of encoded composite letters in the population is determined (here the term valid positions designates the positions in the molecules of the population in which data is presumably encoded).
In some implementations, the sequencing itself may be implemented utilizing conventional sequencing technologies (e.g. standard DNA sequencing as in [4]). A novel technique of the invention provides for inferring the original letter encoded in the molecular data storage, based on the observed vector indicative of weather basic building-blocks of certain types should occur the at a position along the strands of the molecular population, at which the letter is encoded (e.g. the vector being in that case occurrence vector). Preferably in some embodiments, the observed vector indicating the probability/frequency (frequency of occurrence) of the basic building-blocks of different types at that position along the strands of the molecular population, at which the letter is encoded.
To this end, the use/introduction of composite letters to the molecular/DNA based data storage extends the available alphabet (i.e. beyond the number of different types of basic molecular building-blocks), and thus allows the coding of longer messages within a fixed synthesized molecule length.
To correctly read a message coded using composite DNA letters, one needs to infer the original composite letter in every position of the sequence from the observed reads. The sequencing readout (i.e. observed sequencing reads) are the product of a complex process, consisting of DNA synthesis, long term storage, sampling, and DNA sequencing. While each step introduces different errors and biases, the most significant parameters that affect the readout are the sampling of molecules to be sequenced, and the sequencing depth. The process can be exemplified by a single model in which the readout result is a multinomial random variable:
X
(N)(σ,wErr,dErr,sErr,iErr)˜Mult(N,(pA(σ),pC(σ),pT(σ),PG(σ)))
The parameters of the distribution are the designed input letter a, the sequencing depth N, and the errors introduced in the synthesis (wErr), storage (dErr), sequencing (sErr), and inference steps of the process.
The sequencing readout probabilities/frequencies (also referred to herein below as observed probability vectors) will most likely not exactly match any letter from the original alphabet. Inference of the original letter is performed by converting the readout to a vector of base frequencies (also referred to hereinbelow as probability vector) and comparing it to the base frequencies of the candidate letters in the composite alphabet.
The comparison can be done, for example, using the Kullback-Leibler divergence (KL) or the Lp norm, such as L1 norm. To assess the performance of the inference step, the inventors developed a simulation model and analyzed the inference rate of the two inference methods on various composite alphabets. In some implementations, the KL divergence, which corresponds to a maximum likelihood estimator, was found to be advantageous.
It should be understood that the composite alphabet approach of the present invention can generally be combined with other coding schemes, thus providing an even greater benefit. To demonstrate this, the DNA fountain code system [3] was modified to support sequences of composite alphabet letters (particularly in this case an alphabet of composite letters based on DNA monomers/nucleotides was used), thus creating a composite DNA fountain system. The inventors successfully encoded the same data file of 2,116,608 bytes used in [3] by using composite DNA fountains of resolutions R=2,4,6,8,10 while keeping all other parameters similar. Then, the inventors simulated the synthesis and sequencing of the designed composite DNA sequences and decoded the original message by using the composite DNA fountain decoding pipeline developed for this purpose by the inventors. A ˜3.7-fold increase was evident in the estimated data density per synthesized position for resolution R=10. In practical terms, this implies a reduction in the number of required DNA oligonucleotides from 72,000 reported in [3]) to 19,300.
DNA based storage systems are limited by chemical constraints of DNA synthesis, storage and sequencing. Conventional techniques deal with these limitations either by employing strict encoding schemes [1], [2], [6] or by using complex coding methodology such as DNA fountains to handle sequence dropout[3]. However, conventional coding schemes/methodologies result in lowering the data storage density/capacity of the DNA based storage.
This is solved according to the technique of the present invention by the use of composite alphabet letters constructed and defined by the occurrence (e.g. frequency/probability of occurrence) of different types of basic building-blocks/monomers in each position of the molecular-strands/polymers/DNA-strands of the data storage system (or of a molecular population within it), by which a composite letter is encoded. Employing a composite alphabet inherently generates balanced molecular strands/sequences, without prevalent biases such as too many Gs or long homopolymers—consecutive occurrences of the same letter (in the context of balanced DNA molecules), resulting from the combinatorial space associated with every designed composite sequence. While unwanted sequences might unavoidably be part of the synthesized molecules, the inherent independence of the different positions renders them negligible, representing an extra benefit of the composite alphabet approach.
Thus according to a first embodiment (embodiment 1) and broad aspect of the present invention, there is provided a data storage system including at least one population of molecular sequences (e.g. molecular strands) defining at least one respective data-block encoding data in the data storage system. The molecular sequences/strands are formed with strings/chains of basic molecular building-blocks including a number Z of different types of basic molecular building-blocks {En}|n=1 to Z, by which data of the data-block is encoded. The data of the data-block is encoded as a sequence S=(π1, π2, . . . , πk . . . , πK-1, πK) of encoded letters {πk} associated with an alphabet Σ whereby the encoded letters {πk} are encoded by the types of basic molecular building-blocks appearing at k respective locations along storage segments of the molecular sequences of the at least one population. The data storage system is characterized in that the alphabet Σ has a size M that is strictly greater than the number Z of different types of basic molecular building-blocks used in the at least one population (M>Z). Each alphabet letter σm in the alphabet Σ≡{σm}|m=1 to M is associated with a vector {Pmn}|n=1 to Z whereby Pmn is indicative of occurrences of basic molecular building-block En of type n in the alphabet letter σm. Accordingly, each encoded letter πk, which is encoded at that location k in the storage segments of molecular sequences of the data-block, can be mapped (by using the vectors {Pmn} of the alphabet letters {σm}) to a corresponding alphabet letter σm, by determining a match between the occurrence of basic molecular building-blocks of different types at the locations k of the molecular sequences of said population, with the vector {Pmn}|n=1 to Z associated with the alphabet letter σm.
Certain embodiments (embodiment 2) of the present invention incorporate the features of the above described (embodiment 1) and further include the following: the vector {Pmn}n=1 to Z is a probability vector defining the alphabet letter σm. In such embodiments Pmn indicates a probability that a basic molecular building-block En of type n, 1≤n≤Z, appears at the location k of the storage segment of a molecular strand of said at least one population in case the letter πk which is encoded at that location, k, corresponds to the alphabet letter σm.
Certain embodiments (embodiment 3) of present invention incorporate the features of the above described embodiment 2 and further include an alphabet whose size |Σ| (the number of letters) is given by
whereby Z is the number of distinct types of said basic molecular building-blocks and R is a resolution parameter indicative of an identifiable resolution of the probability (resolution of the identifiable content percentage) at which basic molecular building-blocks of a certain type appear in each location k along the storage segments of the plurality of molecular sequences of the population. In this regard the resolution R may be for example defined as one over the minimum absolute difference between probabilities Pmn of the basic molecular building-blocks in the definition of the letters {σm}|m=1 to M of the alphabet Σ, such that R≡1/Minn,m1,m2 [(Abs(Pm1n−Pm2n))] for any type of basic molecular building-block indexed 1≤n≤Z, and any pair of distinct letters m1≠m2.
Certain embodiments (embodiment 4) of the data storage system of the present invention incorporate the features of the above described embodiments 2 or 3 and are adapted to being read with N fold nominal sequencing depth or higher. Each encoded letter πk that is being read from the position k, is represented by an observed probability vector Xk={xk(En)/N}|n=1 to Z whereby xk(En) is the number of times the basic molecular building-block of type En was read in the location k out of the N fold sequencing depth. The observed probability vector Xk, is thus indicative of an observed probability that the basic molecular building-blocks of type En|n=1 no Z appear in the location k.
In certain embodiments (embodiment 5) of the data storage system of the present invention which incorporate the features of embodiment 4, the resolution R of the alphabet is a function of the sequencing depth N by which reading the information stored in the data storage system is intended (there is a positive correlation between R and N). The resolution R of the alphabet may also be a function of the Inference Error iErr which is indicative of a desired probability of wrongly associating the observed probability vector Xk to one of the alphabet letters σk being read from the location k with a negative correlation between R and iErr.
To this end, it should be noted that in some cases the Inference Error iErr may be composed of:
In certain embodiments (embodiment 6) of the data storage system of the present invention, which incorporate the features of above indicated embodiment 4 or 5, the mapping between an observed probability vector Xk at the location k and the inferred alphabet letter πk is performed by determining an alphabet letter σk satisfying a minimum divergence from the observed probability vector X, σk=ArgMin[{σm}m=1 to M|D (σm, Xk)], where D is a divergence function.
In this connection, the divergence function D (σm, Xk) may be for instance an LP distance function. Alternatively or additionally, the divergence function D (σm, Xk) may be a Euclidean distance D (σm, Xk)=∥σm−Xk∥. Alternatively or additionally, the divergence function D (σm, Xk) may be Kullack-Leibler divergence D (σm, Xk)=KL (σm, Xk).
In certain embodiments (embodiment 7) of the data storage system of the present invention, which incorporate the features of any of the above indicated embodiments 1 to 6, the data storage system may include a plurality of populations of the molecular strands/sequences defining a respective plurality of data-blocks encoding data in the data storage system.
In certain embodiments (embodiment 8) of the data storage system of the present invention, which incorporate the features of embodiment 7, each molecular sequence of the molecular sequences includes a population identification segment which includes an identifying sequence of molecular building-blocks. The identifying sequence is indicative of the population with which the respective molecular sequence is associated, and is different in molecular sequences associated with different ones of the plurality of populations. In this regard, typically, although not necessarily, the molecular building-blocks of the identifying sequence may be for example selected from the same Z types of basic molecular building-blocks used in the data storing segment of the molecular strands.
In certain embodiments (embodiment 9) of the data storage system of the present invention, which incorporate the features of embodiment 8, a difference between identifying sequences that are used in population identification segments of different respective populations exceeds a predetermined threshold. In this regard the threshold may be for example measured by a certain predetermined distance metric of strings, such as an edit distance metric between strings.
In certain embodiments (embodiment 10) of the data storage system of the present invention, which incorporate the features of embodiment 8 or 9, the molecular sequences of one or more of said plurality of populations are contained together in a common region. To this end, molecular sequences associated with the same population can be exclusively selected by utilizing binding molecules configured and operable for selectively binding to the population identification segment of the molecular sequences associated with that same population.
In certain embodiments (embodiment 11) of the data storage system of the present invention, which incorporate the features of any of the above indicated embodiments 7 to 10, the data storage system includes a structure defining a plurality of distinct regions at which molecular sequences of different respective populations reside, respectively. For example, the molecular sequences of different respective populations may reside exclusively and respectively, at said distinct regions (this may be important particularly in cases where the molecular sequences have no population identification segment or other means to identify to which population they belong).
In certain embodiments (embodiment 12) of the data storage system of the present invention, which incorporate the features of any of the above indicated embodiments 1 to 11, the types of basic molecular building-blocks include A, C, G, and T nucleotides and/or chemical modifications thereof (e.g. modifications such as methylation). For instance the types of the basic molecular building-blocks/monomers may be constituted by the A, C, G, and T nucleotides (and/or the chemical modifications thereof). Alternatively the types of the basic molecular building-blocks/monomers may be constituted by the A, C, G, and T nucleotides (and/or their chemical modifications)), plus an additional one or more types of basic molecular building-blocks.
In certain embodiments (embodiment 13) of the data storage system of the present invention, which incorporate the features of any of the above indicated embodiments 1 to 12, the types of basic molecular building-blocks are predetermined oligomers (oligos) of the same length. More specifically, preferably basic molecular building-blocks are predetermined oligos of short length (e.g. short k-mers) whose length is in the range of 1 to 20 bases (e.g. specifically, they are nucleotide triplets, quadruplets).
In certain embodiments (embodiment 14) of the data storage system of the present invention, which incorporate the features of the above indicated embodiment 13, the length of the predetermined oligos is larger than 1 (e.g. specifically, they may be formed as doublets or triplets or quadruplets of bases/nucleotides).
Another broad aspect (embodiment 15) of the present invention is a method for storing data. The method includes:
Typically for example, in case the component Pmn indicates that the basic molecular building-block En occurs in the alphabet letter σm this means that if the encoded letter πk at the location k matches the alphabet letter am, then the basic molecular building-block En should occur at least in a certain minimal plurality of the molecular sequences of the population in order for it to have (be counted as having) statistical significance during reading.
In certain embodiments (embodiment 16) of the method for storing data according to the present invention, which incorporate the features of the above indicated embodiment 15, the vector Pmn defining each alphabet letter σm in the alphabet Σ is probability vector σm≡{Pmn}|n=1 to Z (or equivalently a frequency vector). To this end Pmn is indicative of the probability (e.g. or equivalently indicative of the frequency which is proportional to the inverse of the probability) of the appearance of basic molecular building-block of type n at location k in the molecular sequences of the population in case the encoded letter a at that location k, corresponds to the alphabet letter σm.
Yet another broad aspect (embodiment 17) of the present invention is a method reading data stored in a molecular data storage system (e.g. which is configured according to the present invention): the method includes:
As indicated above, also according to this aspect of the invention, the size of the alphabet Σ is greater than the number Z of the different types of basic molecular building-blocks, |Σ|>Z. The letters {σm} of the alphabet Σ include composite letters, each composite letter being defined by a vector σm≡{Pmn}|n=1 to Z which includes two or more non-zero probabilities Pmn of two or more different respective types of the basic molecular building-blocks. Accordingly, said associating includes, per each of the locations k=1 to K, mapping the observed probability vector Xk at the location k to one letter σk∈Σ≡{σm}|m=1 to M of the alphabet Σ. This may be achieved by determining the alphabet letter σk whose vector satisfies a minimum divergences from the observed probability vector Xk, σk=ArgMin[{σm}|m=1 to M|D (σm, Xk)], where D is a divergence function. Accordingly the method provides for determining a sequence S″={σ″k}|k=1 to K of letters of the alphabet Σ, wherein the sequence is inferred from the molecular data storage system and is indicative of the data stored by the data-block.
In certain embodiments (embodiment 18) of the method for reading data according to the present invention, which incorporate the features of the above indicated embodiment 17, the vector {Pmn}m defining each alphabet letter σm in the alphabet Σ is a probability vector and whereby Pmn designates the probability (e.g. or equivalently frequency or its inverse) of the appearance of basic molecular building-block of type n at location k in the molecular sequences of the population, in case the encoded letter πk at that location k, corresponds to the alphabet letter σm.
In certain embodiments (embodiment 19) of the method for reading data according to the present invention, which incorporate the features of the above indicated embodiments 18, the size |Σ| of the alphabet, is given by
whereby Z is the number of distinct types of basic molecular building-blocks, and R is a resolution parameter indicative of an identifiable resolution of the probability (resolution of the content percentages) at which basic molecular building-blocks of a different types appear in each location of the monomer strings.
The resolution R may be for example defined as one over the minimum of probabilities Pmn of the basic molecular building-blocks appearing in the definition of the letters {σm}m=1 to M of the alphabet Σ, such that R≡1/Minn,m1,m2 [(Abs(Pm1n−Pm2n))] for any type of basic molecular building-block is indexed 1≤n≤Z, and any pair of distinct letters m1≠m2.
In certain embodiments (embodiment 20) of the method for reading data according to the present invention, which incorporate the features of the above indicated embodiment 19, the sequencing depth N is set/adjusted as a function of the resolution parameter R of the data storage system.
In certain embodiments (embodiment 21) of the method for reading data according to the present invention, which incorporate the features of any of the above indicated embodiments 17 to 20, the divergence function D (σm, Xk) may be for instance an LP distance function. Alternatively or additionally the divergence function D (σm, Xk) may be a Euclidean distance D (σm, Xk)=∥σm−Xk∥. Alternatively or additionally, the divergence function D (σm, Xk) may be Kullack-Leibler divergence D (σm, Xk)=KL(σm, Xk).
In a further broad aspect (embodiment 22) of the present invention there is provided a data reader system adapted to read data stored in a molecular data storage system. The data reader system includes:
In certain embodiments (embodiment 23) of the data reader system of the present invention, which incorporate the features of the above indicated embodiment 22, the data reader system is configured and operable for implementing the method according to any one of the above indicated embodiments 17 to 21.
In certain embodiments (embodiment 24) of the data reader system of the present invention, which incorporate the features of the above indicated embodiment 22 or 23, the sequencing control module is adapted to apply sequencing of N fold nominal sequencing depth to the population of molecular sequences formed with a number Z of different types of basic molecular building-blocks {En}|n=1 to Z and to determine, per each location k out of 1 to K locations of a storage segments of the molecular sequences of the population, an observed probability vector indicative of Xk={xk(En)/N′}|n=1 to Z whereby xk(En) is the number of times, out of an N′ fold actual sequencing depth obtained for the population, at which a basic molecular building-block of type En was found in the location k. The data inference processing module is adapted for associating each observed probability vector Xk with one of alphabet letters {σm} of an alphabet {σm}|m=1 to M.
In yet a further broad aspect (embodiment 25) of the present invention, there is provided a method for fabricating (i.e. manufacturing) a molecular data storage system. The method includes:
In certain embodiments (embodiment 26) of the method for fabricating the molecular data storage system according to embodiment 25, the depositing includes:
In certain embodiments (embodiment 27), the method for fabricating the molecular data storage system according to embodiment 25 or 26 further includes that the region of the support plate comprises cleavable molecules adapted to bind with said basic molecular building-blocks. Accordingly, the basic molecular building-blocks of the first composition that is being first deposited on that region, are bounded to the cleavable molecules.
Some embodiments (embodiment 28) of the method for fabricating the molecular data storage system, incorporate the features of the above indicated embodiment 27 or 26 and further include harvesting the population of molecules from the respective region by cleaving the cleavable molecules.
In some embodiments (embodiment 29) of the method for fabricating the molecular data storage system, which incorporate the features of any one of the above indicated embodiment 25 to 28, the synthesizing of the population of molecule sequences includes synthesizing similar population identification segments, in all molecule sequences of the population. The population identification segment of each molecular sequence is indicative of the population with which the molecular sequence is associated and is different in molecular sequences of different populations. For example the population identification segment of a molecular sequence may include a synthesized identifying sequence of the basic molecular building-blocks, which is indicative of the respective population with which the molecular sequence is associated.
In some embodiments (embodiment 30) of the method for fabricating the molecular data storage system, which incorporate the features of the above indicated embodiment 29, the difference between identifying sequences that are used in population identification segments of different respective populations, exceeds 3 in an edit distance metric.
In yet another broad aspect (embodiment 31) of the present invention, there is provided a molecular data storage fabrication/manufacturing system that is configured and operable for fabricating a molecular data storage structure according to the present invention. The fabrication system includes:
It should be understood that the terms control unit/module and/or controller used herein may pertain to any type of control systems, digital or analogue, which may be implemented by any suitable circuitry and/or by software/firmware instructions executable by suitable computerized systems/circuits and/or by suitable analogue and/or digital hardware.
In some embodiments (embodiment 32) of the molecular data storage fabrication system incorporating the features of embodiment 31 above, the control unit is configured and operable for implementing operations (b) and (c) of the method of embodiment 25 above.
In some embodiments (embodiment 33) of the molecular data storage fabrication system incorporating the features of embodiment 31 or 32 above, is configured and operable for implementing the method of any one of the above indicated embodiments 25 to 30.
In some embodiments (embodiment 34) of the molecular data storage fabrication system incorporating the features of any one of embodiments 31 to 33 above, the one or more containers include L>Z containers. Z containers out of said L containers are adapted for separately containing different types of the Z types of basic molecular building-blocks {En}|n=1 to Z. The remaining L-Z containers are one or more mixture containers, and are adapted for containing one or more mixtures of two or more of said Z types of basic molecular building-blocks.
In some embodiments (embodiment 35) of the molecular data storage fabrication system, which incorporates the features embodiment 34, the letters {σm}|m=1 to M of the alphabet ⊖ further include up to Z simple letters, whose probability vectors Pmn include only one probability having a non-zero value. The fabrication head is fluidly connected to the L containers of the respective basic molecular building-blocks and is configured and operable for synthesizing each simple letter of the simple letters are by controlled deposition of a volume of basic molecular building-blocks obtained from a selected one of the L containers, which contains the basic molecular building-blocks of the type associated with that simple letter.
Some embodiments (embodiment 36) of the molecular data storage fabrication system, which incorporates the features of embodiment 34 or 35, also include a mixer module for preparing said one or more mixtures of two or more of the Z types of basic building block molecules. The mixture module is configured and operable for processing the probability vector of a composite letter, and preparing, in at least one of the one or more mixture containers, a mixture of basic building block molecules of the Z types with concentration ratios matching the probability vector of the composite letter. Some embodiments (embodiment 37) of the molecular data storage fabrication system, which incorporates the features embodiment 36, include one mixture container in which the mixer module is configured and operable for preparing on demand, different mixtures associated with different respective composite letters.
Some embodiments (embodiment 38) of the molecular data storage fabrication system, which incorporates the features of embodiment 36 or 37, include a plurality of mixture containers for containing different mixtures associated with different respective composite letters.
In some embodiments (embodiment 39) of the molecular data storage fabrication system, which incorporates the features of any one of the above indicated embodiments 31 to 38, the basic molecular building-blocks contained in the containers are “blocked” from one end thereof so as to prevent their binding to one another. The fabrication head is configured and operable for carrying out the following after deposition of basic molecular building-blocks of each letter at said region:
In some embodiments (embodiment 40) of the molecular data storage fabrication system, which incorporates the features of any one of the above indicated embodiments 31 to 39, the fabrication head is configured and operable for depositing cleavable molecules at the region at which synthesizing is to be performed (prior to said synthesizing operation).
Some embodiments (embodiment 41) of the molecular data storage fabrication system incorporating the features of embodiment 40 also include a harvesting module configured and operable for harvesting the population of molecules from that region by cleaving the cleavable molecules.
In some embodiments (embodiment 42) of the molecular data storage fabrication system, which incorporate the features of any one of the above indicated embodiments 31 to 41, the control unit is adapted for operating the fabrication head for synthesizing a similar identification segment for all molecules of the same population. The similar identification segment may for example include an identifying sequence of the Z types of basic building block molecules.
In some embodiments (embodiment 43) of the molecular data storage fabrication system, which incorporate the features of any one of the above indicated embodiments 31 to 42, the control unit is configured and operable for operating the fabrication head to synthesize a plurality of population of molecular sequences encoding data of a plurality of respective data blocks, at different spatially separated respective regions.
According to a further broad aspect of the present invention (embodiment 44) there is provided a molecular label (e.g. identifier, tag, marker) including a data storage system according configured according to any one of the above indicated embodiments 1 to 14. The molecular label includes at least one data-block that is being respectively encoded by at least one population of molecular sequences. For example the at least one data-block may define a unique label data segment that is indicative of an entity that is to be tagged using the population(s) of molecular sequences of the data storage.
With properly selected (suitable) molecular building blocks, the molecular label of the present invention may include tagging mixtures (populations) that can be used for tagging food, 3D printed products, synthetically manufactured implantation organs, building materials, airplane parts, and other mechanical parts.
In an embodiment (embodiment 45) of the molecular label incorporating the features embodiment 44, the molecular sequences in the data encoding population(s) may be protected and/or encapsulated against degradation, by any suitable technique as known in the art, for instance as described in U.S. Pat. No. 9,850,531.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawings will be provided by the Office upon request and payment of the necessary fee.
In order to better understand the subject matter that is disclosed herein and to exemplify how it may be carried out in practice, embodiments will now be described, by way of non-limiting example only, with reference to the accompanying drawings, in which:
Reference is made to
One of the data blocks, data-block 110.1 of the data storage system 100 will now be described in more detail. The data-block 110.1 includes the population 112 of molecular strands/sequences PMs, by which the data stored by the data block is encoded.
Generally, the molecular strands/sequences PMs include strings of basic molecular building-blocks formed with a number Z of different types of the basic molecular building-blocks {En}n=1 to Z (where En is indicative of a type of the basic molecular building-block and n is an index running from 1 to Z for the different types participating in the data storage). The data of the data-block 110.1 is encoded by the sequences of basic molecular building-blocks in the molecular strands/sequences PMs of the data-block 110.1. In some implementations the data of the data-block 110.1 is encoded in an ordered sequence S=(π1, π2, . . . , πk . . . , πK-1, πK) of letters {πk} encoded in the population 112 of molecular strands/sequences PMs. The encoded letters {πk} are generally associated with, or belong to, an alphabet Σ that is used for encoding the data.
The encoded letters {πk} are encoded by the order of the Z types of basic molecular building-blocks {En}|n=1 to Z arranged at least in parts of the molecular strings/strands/sequences PMs of the population 112. Nonetheless, according to the technique of the present invention, the size M=|Σ| of the alphabet Σ (namely number of distinct letters therein) is greater that the number Z of different types of basic molecular building-blocks that are used/included in the molecular strands/sequences PMs, (M>Z).
This is achieved by exploiting the redundancy of molecular strands/sequences PMs in the population 112, to define the letters in the alphabet Σ in statistical terms indicating probabilities of existence of each of the Z types of basic molecular building-blocks in the letter. In this manner, the number of M of different letters which are defined in the alphabet Σ may be higher than the number Z of basic molecular building-block types.
In other words, according to the present invention, a letter σm in the alphabet Σ≡{σm}|m=1 to M can be represented (or is defined) by a probability vector σm≡{Pmn}n=1 to Z. The probability Pmn indicates the probability that a basic molecular building-block of type En (n being the index of the type running from to Z (1≤n≤Z)) appears at a certain location (e.g. indexed k) along the molecular strands/sequences PMs of a population 112 in case the respective letter σm is encoded in that location.
Generally, when considering the letters definition in terms of the probability vector σm≡{Pmn} then the sum of probabilities of each letter's σm definition should equal one, Σn=1 to Z (Pmn)=1. However, although defining the letters in terms of probabilities may be convenient, it should be understood that alternatively or additionally, the composite letters may be equivalently defined by a frequency vector indicative of the respective frequencies/concentrations (Cmn) at which each type of basic molecular building-block, indexed n, appears in the letter σm (m being the index of the letter). In such an equivalent definition, the sum of the frequencies/concentration may not necessarily be equal to one. To this end, the probability vector may be considered as a normalized version of the frequencies/concentrations (Cmn).
Conventional molecular storage techniques (e.g. such as disclosed in [1]-[3], [6]), encode the data using an alphabet whose size is equal to or smaller than the number of types of monomers/building blocks of the molecular sequences. In other words, in such conventional techniques there is one-to-one correspondence between the alphabet letters and the types of monomers.
Indeed, this type of alphabet letters, as used in conventional techniques, which correspond exclusively to a single type of monomer, may also optionally, but not necessarily, be used in the technique of the present invention and are referred to in the following as Simple Letters.
In the notation used in the present application, where an alphabet letter σm is designated by the probability vector of σm≡{Pmn}|n=1 to Z, the letter σm may be regarded as a Simple Letter if its probability vectors {Pmn} include only one probability having non-zero expected value, e.g. Pmn=1 only for the index n corresponding to one certain type of basic molecular building-block n=z′ and Pmn=0 for all other indices n≠z′.
However, as indicated above, according to the present invention, the number M of letters in the alphabet Σ is greater than the number Z of building-block types and there is no one-to-one correspondence between letters and building-block types. This is achieved by utilizing letters which are referred to herein as Composite Letters. In Composite Letters the probability vectors {Pmn} include two or more non zero probabilities, i.e., Pmn>0 for indices n corresponding to at least two types of basic molecular building-blocks n=z′, n=z″, z′≠z″. In other words, a composite letter may be considered as any letter σm corresponding to a vector, which is not a simple letter vector, or yet, alternatively, a composite letter σm may be defined as a letter whose probability vector components Pmn<1 for all building-block types 1≤n≤Z, m being the fixed index of the letter.
Thus, according to the present invention, the alphabet Σ includes: (i) up to Z simple letters (where Z is the number of different types of basic building block molecules participating in the encoding of data in the molecular population(s) of the system); and (ii) one or more composite letters whose probability vectors are Pmn.
Turning now to
Except for letters σ1, σ21, σ31 and σ35, which are simple letters, the rest of the 35 letters in the exemplified alphabet Σ are composite letters, whose probability vectors include probabilities that two or more different types of the A, C, G, and T monomers (nucleotides) appear in the respective position in which the letter is encoded in the molecular strings/strands/sequences PMs of the population 112.
A resolution R of the alphabet Σ and/or of its individual letters, is an important parameter by which the size M=|Σ| (number of letters) of an alphabet Σ constructed from a given number Z of building-block types {En}n=1 to Z, may be determined. The resolution R parameter is defined as one over the minimally allowed absolute difference between probabilities of the same building-block type appearing in the definition of two arbitrary letters in the alphabet Σ. Namely, R≡1/Minn,m1,m2 (Abs(Pm1n−Pm2n))] for any type n of the Z building-block types 1≤n≤Z and any pair of distinct letters m1≠m2; 1≤m1,m2≤M. To this end, in other words, the resolution parameter R represents the difference between distinct values that each probability component Pmn in the vectors defining the alphabet letters {σm} can acquire for given type n. As will be described below, the resolution parameter can be actually determined based on various error rates expected during the writing (synthesizing) and reading (sequencing) of the data storage 100, degradation related errors, and the error correction codes included in data and acceptable error probabilities.
Considering a given resolution parameter R, the maximal size/number-of-letters in the alphabet Σ is given by
whereby R is the resolution parameter and Z is the number of distinct types of basic molecular building-blocks. This is because the number of possibilities of different letters is in this case equivalent to the combinatorial number of unordered combinations with repetitions for selecting (in the vectors defining the letters) a total number R of monomers from the Z types. To this end, for the alphabet Σ exemplified in
considering that the number of building-block types is Z=4.
It should be understood that the types of molecular strands/sequences PMs used in the data storage system 100, and the building-blocks types used therein, may differ from implementation to implementation of the system depending on various prerequisites required from the data storage system. For instance, as exemplified above and in the following, the molecular strands/sequences PMs may be bio-polymers, such as nucleic acid, DNA or RNA, which are poly-nucleotide molecules constructed with Adenine, Cytosine, Guanine, and Thymine nucleotides (A,C,G,T) as building-blocks/monomers (DNA), or with Adenine, Cytosine, Guanine, and Uracil nucleotides (A,C,G,U) as building-blocks/monomers (RNA). In other instances, the molecular strands/sequences PMs may include other polymers types, bio-polymers or not, with any number Z>1 of monomer/building-block types as permitted by the chemistry of the type of polymers used. To this end, data storage system 100 of the present invention may be implemented with the building-block types including or consisting of the A, C, G, and T nucleotides, and/or the A, C, G, and U nucleotides, or with these nucleotides plus additional one or more building-block types, or with different sets of basic molecular building-block types, being e.g. bio-type monomers and/or other, e.g. synthetic[7], monomers.
Turning back to
It should be noted that the phrases molecular strand, molecular sequence as well as polymer molecule, are used herein to indicate molecules composed of at least one chain of many building-blocks (i.e. being the basic subunits of the molecule, which are referred to herein as monomers). In the molecular strand/sequence, the basic molecular building-blocks/monomers are arranged in a chain/string, which may be a simple linear chain (with no branches), or a branched chain which includes one or more branch points at which the chain/string of building-blocks/monomers is split into several strings. In any case, for clarity, each molecular strand/sequence is considered herein to include a chain/string/sequence of building-blocks/monomers. It should be also understood that the term section, used herein in relation to a part of the monomer string/chain, should not be considered necessarily as a continuous section of the string/chain, but may be considered to be a set of predetermined locations {k}, adjacent or not, along the chain/string of monomers of the molecular strands/sequences, which serve a designated purpose. For instance, the data encoding sections 115, are sections which indicate how monomer/building-blocks constituents (in different locations {k} thereof) are used to encode the data stored by the system 100. Such sections 115, as well as other sections (e.g. 114 and 116) are illustrated for clarity in the figure as continuous, however, it should be understood that they are not necessarily continuous, but merely represent sets of predetermined locations along each of the molecular strands/sequences PMs of the population 112.
Table 1 in
For example, the encoded letter π1 in
It should be understood that the example of
It should be noted that in some embodiments of the present invention the data storage system 100 may be configured and operable for storing large amounts of data and may include a large number of data blocks (populations).
Alternatively or additionally, in some embodiments the data storage system 100 may be configured and operable for use as a molecular mark/label or tag (e.g. marker/tag) which can be applied on or within an object which is to be marked/labeled, and/or optionally embedded within the material constituting the object, for labeling the object and for enabling its identification or verification. In this case the data storage system 100 may include at least one data-block (e.g. as few as one population of molecular sequences), by which the marking data indicative of the molecular mark is encoded. In some embodiments the molecular tag or label further includes, in addition to the data storage system 100, also additional constituent materials selected/designed for embedding and/or binding the molecular mark on an object in a designated way. The additional constituent materials may include for instance material that encapsulates the coding material and protects it against degradation as is described in U.S. Pat. No. 9,850,531. It should be emphasized that this invention provides for using composite encoding within such tagging systems, enabling more tagging flexibility.
As also shown in
where L is the PM length (e.g. in the order of 50 to 1000 building-blocks as said above) can be stored by each such population. Thus, typically, in most cases, a plurality of such populations/data-blocks 110 are included in the data storage.
Indeed, in some implementations, the data storage is configured such that the different populations (112) of molecules, which are associated with different data-blocks 110, reside at different physical regions/places, and can thus be distinguishable based on their region. For instance, the populations may be stored in separate regions of a matrix/plate carrier or on different containers, such that molecules of different populations (112) can be separately read/sequenced from the different locations.
Alternatively, or additionally, as shown in
As shown, in the present example of
It should be noted in some embodiments, e.g. particularly in case where the molecular strands/sequences are composed of A,C,G,T monomers, the identification segments can be located at the so called 5p-end of the molecules, or at the so called 3p-end of the molecules, or, generally they may also be located anywhere else along the monomer/building-block strings/sequences of the molecules. In some particular implementations/embodiments of the invention, it may be preferable to locate the identification segments on the 5p-end of the synthesized molecules. This is because the quality of synthesized polymer tends to be higher at the 5p-end of the molecule.
Table 3 in
It should be noted that in some embodiments of the present invention the molecular strands/sequences PMs of different populations/data-blocks 110 are configured/synthesized such that the identifying sequences ID-SEG (114) which identify different ones of the populations/data-blocks 110 differ from one another by a difference exceeding a certain predetermined threshold. More specifically, in some embodiments of the present invention, the molecular data storage 100 may be configured such that each two different identification sequences/segments of building-block/monomers which are used for identifying molecular strands/sequences of different populations/data-blocks differ from one another by at least a certain predetermined distance threshold measured on a certain preselected distance metric of strings. For example, the certain distance metric of strings used may be the so called edit distance (as generally known in the art), and the minimal threshold edit distance between different identification sequences/segments may be, in some cases, at least 3 edit operations measured in the edit distance metric. Using the certain minimal distance (e.g. 3) may be preferable because the mapping of the letters in the population to the composite alphabet depends on identifying every molecule as a member of the correct population.
Turning now together to
In the molecular data storage systems type A, 100A, shown in
In the molecular data storage systems type A, 100A, shown in
In the molecular data storage systems type B, 100B, shown in
The general molecular data storage system 100 shown in
Reference is now made to
In 210 data of at least one data-block (e.g. 110.1) to be stored by the system, is provided. The data is designated to be encoded by a respective population (e.g. 112) of molecular strands/sequences PMs that are formed with a number Z of different building-block types {En}n=1 to Z. In 220 the data of the data-block 110.1 is processed for presenting it as data sequence S=(σ1, σ2, . . . , πk . . . , σK-1, σK) of letters of the alphabet Σ≡{σm}|m=1 to M which is used according to the present invention, as described above (namely the alphabet Σ having the size M≡|Σ|>Z and/or the alphabet s including the composite letters, as those which are defined above). To this end, utilizing such alphabet Σ, the sequence representing the data is generally shorter than the length of a required sequence, in which conventional techniques, whose alphabet is based on the types of monomers themselves, as letters. This is because the numeral-basis of the alphabet Σ of the present invention is the size M which is greater than the numeral-basis of a conventional alphabet, whose size is the number Z of monomer types used in the molecular data storage.
230 includes the data sequence S=(σ1, σ2, . . . , πk . . . , σK-1, σK) of the data-block 110.1 being encoded in sequence S′ of encoded letters S′=(π1, π2, . . . , πk . . . , πK-1, πK) in the population 112 molecular strands/sequences PMs formed with the types {En}n=1 to Z of basic molecular building-blocks. To achieve this, optionally in 232 the alphabet Y (e.g. such as that represented in
Optionally, as shown in 240, each of the molecular strands/sequences of the population 112 pm, is an identifying sequence of building-blocks (e.g. selected from the Z types). The identifying sequence may serve as a unique identifier of the data-block whose data is encoded in the population 112.
Optionally, as shown in 250, a plurality of data blocks with respective data are encoded/formed by repeating 210 to 240 to provide a respective plurality of populations that encode the corresponding data of the plurality of data blocks. As indicated in 252 the different populations may be located at different regions to enable to distinguish between molecules of different populations. Alternatively or additionally, as shown in 254 molecular strands/sequences of different populations include different identifications segments/sections identifying their respective population, and distinguishing between the molecules of different populations.
Reference is now made to
The data reader system 300 includes a sequencing control module 310 (hereinafter also referred to as sequencing controller) configured and operable for connecting-to/communicating-with a sequencing system 340 (which may or may not be part of the system 300), and data inferencing module 320. The inferencing module 320 may include an alphabet data provider module 322 configured and operable for obtaining data indicative of an alphabet Σ which is used for encoding data in a molecular storage system of the invention (e.g. the alphabet data may be such as that exemplified in
To this end, based on the alphabet Σ, the data reader system 300 (e.g. the sequencing controller 310) may determine a required nominal sequencing depth N to sequence the population of molecules of the molecular data storage system 100. In this regard it should be noted that the one property of the alphabet which can be used to determine the required nominal sequencing depth N, is the resolution parameter R of the alphabet Σ. More specifically, the nominal sequencing depth N required for reliable reading/inference of the data stored in the molecular data storage is a function of the resolution parameter R of the alphabet Σ (whereby higher resolution means smaller statistical distance/difference between the definitions of different letters in the alphabet, which thus requires higher sequencing depths, namely deeper reading, in order to obtain and reliably infer the letters encoded in the population of molecular strands/sequences). The required nominal sequencing depth N is also a function of the Inference Error probability, iErr, being the probability of wrongly associating a synthesized letter πk that is being sequenced from the population 112 of molecules, to the correct letters of the alphabet Σ. In this regard it should be understood that the synthesized letter πk is actually synthesized/written in location k of the plurality of molecular strands/sequences in the population 112 by the existence of building-blocks of the Z different types with amounts/concentrations (probabilities of existence) {C(En)}kn=1 to Z which corresponds to the probability vector {Pk,n} of the alphabet letter σk that should have been written in the location k. Indeed the population 112 may include in the order of O˜105 to 108 molecular strands/sequences by which the synthesized letter πk is written. However, when sequencing the population 112, only the order of N molecules is sequenced (N being the sequencing depths). This may lead to the inference Error iErr, which is the statistical error associated with a possible error selection/end examination of only N out of the O molecules in the population. This may lead to a discrepancy between the synthesized letter πk that is written in the population 112 in the form of the amount/concentration vector {C(En)}kn=1 to Z of the Z types of basic building block molecules in the location k of the population, and observed probability vector {Xk} that the Z types of building blocks (monomers) appear in the location k, due to that the observed probability vector {Xk} is determined based only on a number N molecular strands/sequences that are being sequenced out of the O molecules of the population 112 (N being the sequencing depth). Thus, a higher sequencing depth N provides for reducing the Inference Error probability, iErr, for a given resolution R of the data storage system 100. To this end the Inference Error probability is given as a function of the resolution R of the alphabet of the data storage system and the sequencing depths as follows: N=F(R, iErr).
Accordingly, in some embodiments the alphabet data provider 322 may be adapted to provide the sequencing controller with data indicative of the resolution property of the alphabet used in the data storage system 110 that is to be sequenced, and the sequencing controller 310 may include a sequencing depth controller 312 that is adapted to utilize input/reference data indicative of an acceptable inference error iErr by which the encoded data should be determined, and utilize the relation N=F(R, iErr) indicated above in order to determine a sequencing depth N by which to operate the sequencing system 340 for sequencing the data block 110.1 of the data storage system 100.
In this regard, it should be noted that the inferencing error is not the only error that may introduce discrepancy between the input data sequence S, which is written in data storage system 100, and the data sequence S″ being thereafter read from the data storage system 100. Turning now to
Considering the above sources of error, a reliability factor RL of correct reading of the data storage, (namely RL being the probability that the correct letter σ″k would be decoded/inferred from the location k of the data storage on corresponds to the letter σk of the sequence S that was intended to be written in the location k) is given by: RL=(1−wErr)*(1−dErr)*(1−sErr)*(1−iErr).
Accordingly, turning back to
Alternatively, or additionally, the system may include a sequencing depth controller 312 of the sequencing controller 310 which may be adapted to operate the sequencing system 340 to sequence a population 112 of molecular strands/sequences PMs with a predetermined nominal sequencing depth N.
Thus, the sequencing controller 310 is adapted to operate the sequencing system 340 to sequence a population 112 of molecular strands/sequences PMs with nominal sequencing depth N.
In this regard it should be noted that in various embodiments of the present invention the sequencing controller 310 may be adapted to operate the sequencing system 340 to sequence the plurality of populations (data blocks) of the data storage system 100, and the resulted sequenced data of the plurality of data blocks may be provided (e.g. from the sequencing system 340) to the Sequence Data Provider module 328 of the data inferencing module 320. In turn, the data inferencing module 320 may include a data-block selector module 326, configured and operable for selecting the one or more data blocks (e.g. 110.1) of the data storage system 100 whose data are to be determined/inferred, and extracting, from the sequenced data (sequencing results) which are received by the Sequence Data Provider module 328, the relevant sequencing data of the data of the selected one or more data blocks (e.g. 110.1). To this end, in this case the sequencing controller 310 operates the sequencing system 340 to sequence all/a plurality of data blocks in the data storage system 100, and extraction of the sequenced data of the relevant data block is performed after the sequencing.
Alternatively or additionally, in some embodiments the sequencing controller 310 may be adapted to operate the sequencing system 340 to sequence the population(s) of only the selected one data block (or more than one data blocks) of the data storage system 100. The sequencing controller 310 may include a data-block selector module 316 that is configured and operable for selecting the data block (or the plurality thereof) which needs to be sequenced. This may be based on input data indicative of the required blocks. In turn, the sequencing system 340 operates to discriminate between (e.g. exclusively sequence) the molecular strands/sequences of the selected data block/population, whereby such discrimination may be based on the region/location at which the molecular strands/sequences of the selected data block/population are located in the data storage system 100 (i.e. considering that this location may be exclusive to the selected population) or by utilizing specifically selected binding molecules which are configured and operable to selectively bind to a unique identification segment associated with molecules belonging to the selected population. It should be understood that this technique can only be operated with populations whose molecules include respective identification segments, and only in case the case the sequencing system 340 includes (or can synthesize “on the fly”) one or more collections of binding molecules, where binding molecules of each collection are adapted to exclusively bind to a respective population (to the identification segment thereof). Thus, in that case, upon receiving operational instructions of the selected data block from the data-block selector module 316, the sequencing system 340 utilizes the designated region of the selected data-block/population, and/or utilizes/synthesized binding molecules capable of binding to the identification segment of the selected data-block/population, to extract/sequence the molecules of the selected data-block separately and provide the sequenced data/results to the Sequence Data Provider module 328.
In turn, regardless of whether data-block selector module 316 and/or data-block selector module 326 is used, the sequencing data/results corresponding to the data segments of the population of molecules in the selected data blocks are provided (separately per each respective data block) to the mapping module 324 of the data inferencing module 320.
To this end, the population 112 of molecular strands/sequences PMs of the selected data-block, e.g. 110.1, is sequenced with an N″ fold sequencing depth (or higher); it is noted that not necessarily the actual sequencing depth N″ is/can be controlled a priori, and may somewhat deviate from the intended/requested sequencing depth N by which the sequencing system 340 was operated. The sequencing of the selected data-block, e.g. 110.1 yields a series/sequence of observed probability vectors {Xk}, including an observed probability vector Xk per each location k in the data segments of the molecular strands/sequences PMs of the population of the selected data-block. 110.1. As indicated above, the observed probability vector Xk of each location k is generally indicative of the normalized number/amount of building-blocks/monomers of each type En found in the location k of the N″ molecular strands/sequences, which were sequenced from the population 112: Xk={C″k(En)/N″} where the index n of the monomer types runs from n=1 to Z (namely to cover all possible types of participating building-blocks).
The mapping module 324 is configured and operable to map/associate each observed probability vector Xk of location k in the sequencing results of the data-block, with the corresponding alphabet letters σ″k being read/inferred from that location. This should be determined per each location k out of 1 to K locations of a storage segments of the molecular strands/sequences of the population 112.
Indeed the alphabet letters {σ″k} being read from the respective locations k should generally belong to the alphabet Σ: σ″k∈Σ≡{σm}|m=1 to M. As indicated above, each letter σm of the alphabet is defined by respective probability vector σm≡{Pmn}n=1 to Z. Accordingly, the mapping module 324 may be adapted to determine the alphabet letter σ″k∈Σ whose divergence from the observed probability vector Xk is minimized: σk=ArgMin[{σm}m=1 to M|D (σm, Xk)], where D is a divergence function. This provides for mapping each observed probability vector Xk to its respective inferred alphabet letter σ″k.
In some embodiments the divergence function D which is used by the mapping module 324 for mapping the observed probability vector Xk to the alphabet letter σ″k∈Σ is an Lp distance function defined over a so-called Lp space. In this regard, as generally known, Lp spaces (also sometimes called Lebesgue spaces) are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. In some examples, the Lp space is a Euclidean space and thus the divergence function D which is used for the mapping is the Euclidian distance/norm D (σm, Xk)=∥σm−Xk∥. The dimensionality of the space may be the number Z of building-block types, which is actually the size Z of the probability vectors of defining the alphabet Σ letters σm and the observed probability vectors Xk.
Alternatively or additionally, in some embodiments the mapping module 324 is configured and operable for utilizing the Kullack-Leibler (KL) divergence as the divergence function D (σm, Xk) by which the mapping is performed. To this end D (σm, Xk)=KL (σm, Xk). The Kullback-Leibler divergence, also known as relative entropy, is a measure of the divergence of one probability distribution from a second, expected probability distribution. In the simple case, a Kullback-Leibler divergence of 0 indicates that similar, if not the same, behavior of two different distributions can be expected, while a Kullback-Leibler divergence of 1 indicates that the two distributions behave in such a different manner that the expectation of seeing the second distribution in an observation, given the first distribution as the generating mechanism, is small.
In this regard, the inventors of the present invention have noted that using the Kullback-Leibler divergence for mapping an observed probability vector Xk to the inferred alphabet letter σ″k∈Σ may be advantageous, and in some cases yields superior results (better/higher reliability factor RL) and particularly may result in reduced mapping/inference errors. This is because minimizing the KL divergence represents the maximum likelihood probability distribution that generates the observed frequencies Xk.
For instance:
where m runs over the types of basic molecular building blocks used. To this end, when using the error-aware multinomial model, the KL approach is equivalent to a Maximum-Likelihood mapping. Since the KL measure is highly sensitive to letters on the edges of the simplex, this approach may be implemented using a variation of the composite alphabet in which zero entries in the probability vectors are replaced with some non-zero small value ϵ>0.
Nonetheless, in some other embodiments, the LP distance function(s), e.g. Euclidean distance, may be used.
Thus, utilizing the selected/a priori-set divergence function, the mapping module 324 maps/associates each of the observed probability vectors {Xk} to respective inferred letters, thus determining an inferred/read sequence S″={σ″k} which, subjected to the reliability factor RL, is similar to the original sequence S={σk} that was encoded/synthesized in the respective data-block, e.g. 110.1, of the data storage module 100.
Reference is now made together to
In 410, a molecular data storage system 100 including at least one data-block encoding data, e.g. 110.1, is provided. The at least one data-block 110.1 is formed by at least one respective population 112 of molecular strands/sequences PMs, which are formed with strings representing chains of building-blocks including a number Z of different types of building-blocks. The data of the data-block 110.1 is encoded in sequence S′=(π1, π2, . . . , πk . . . , πK-1, πK) (e.g. ordered) of encoded letters {πk} belonging to the alphabet Σ, whereby the identity of each encoded letter πk∈Σ is indicated by the types of building-blocks existing at certain respective locations corresponding to k along the building-block strings of the molecular strands/sequences of the population 112.
Optionally, in 420 (which may be carried out prior to sequencing of molecules of the population 112), the molecules of the certain population 112 may be distinguished (e.g. separated and/or identified), from molecules of other populations, if such exists. In case there is only one population/data-block, this operation is trivial, as shown in optional 422. Alternatively or additionally, in case the data storage system 100 is configured such that the molecules of the certain population 112 reside separately from other populations, location based sequencing 424 of the molecules may be performed only at the region of the population 112 thereby not sequencing (distinguishing from) molecules of other populations. Yet, alternatively or additionally, in case the molecules of the certain population 112 include population identification segments ID-SEG uniquely identifying the certain population 112, specific binding to these population identification segments 426 may be carried out in order to distinguish (exclusively extract) molecular strands/sequences of the certain population 112, for further sequencing. In this regard, it should be noted, and also as indicated above, that optionally the difference of the population identification segments that are indicative-of/associated-with different populations, is made sufficiently large, such that the binding is substantially exclusive to population identification segments of the certain population 112, and the same binding molecule designed for identification segments ID-SEG of the certain population 112 does not happen to bind “by existence” to a somewhat different identification segment ID-SEG of other populations (since the “distance”/difference between such identification segments is sufficiently large).
In 430, sequencing of N fold nominal sequencing depth is performed to the molecular strands/sequences of the data storage system 100, or just to the molecular strands/sequences of the specific/certain population 112 (depending on implementation and/or possibly on whether 420 was performed, or 440 is to be performed). To this end, optional 440 is performed in cases where there is more than one population 112 (more than one data block 110) in the data storage system 100 and 420 was not/could not be performed in order to distinguish the population 112 of the respective data block of interest, e.g. 110.1, from other populations 112 (from other data-blocks). In this case 440 may be carried out to identify from the sequencing results, the sequenced molecules whose respective identification segments (if such exist) match the identification segment ID-SEG of molecules of the certain population of interest 112.
Thus, finally after 430 is performed, and optionally also 420 and/or 440, sequencing results with sequencing depth N″ being about the nominal sequencing depth N are obtained for the molecules of the certain population of interest 112. The sequencing results include data indicative of the data storage segments DATA-SEG of the sequenced molecules of the population 112.
Accordingly, in 450, the data storage segments of the sequenced molecular strands/sequences PMs of the certain population 112 are processed to determine per each location k (of the K locations in the data storage segments DATA-SEG of the sequenced molecular strands/sequences), an observed probability vector Xk indicative of observed relative amounts of the Z types of building-blocks {En} in the location k. This may be achieved by counting′ per locations k, how many times each type of the Z building-block types appear in that location in the number N″ of sequenced molecules of the population 112 (e.g. and then normalizing by division by N″ to get a probability value).
In this regard it should be noted that observed probability vectors are not obtained for ID segments, since the ID segments do not encode letters of the alphabet Σ (which includes composite letters), but, on the contrary, in each location in the ID segments of all the molecular strands/sequences of the population, the same building-block type should exist as defined by the id of the population (a molecule with different monomer types is not associated with the same population).
In 460 the inferred letters in the read sequence S″ are inferred by associating each inferred observed probability vector Xk with an inferred letter σ″k of an alphabet Σ≡{σm}|m=1 to M. The inferred sequences S″ of alphabet letters read from the molecular data storage are thus determined as follows S″=(σ″1, σ″2, . . . , σ″k . . . , σ″K-1, σ″K). The inferred sequences S″ should generally correspond to the encoded sequence S′=(π1, π2, . . . , πk . . . , πK-1, πK) in the certain population 112, up to errors which may be associated with the sequencing errors sErr and inference/mapping errors iErr indicated above.
As indicated in the figure, optionally 461 is conducted for mapping each observed probability vector Xk to the respective inferred alphabet letter σ″k∈Σ, by determining the alphabet letter σ″k∈Σ that satisfies a minimum divergence D from the observed probability vector Xk: σ″k=ArgMin[{σm}m=1 to M|D (σm, Xk)]. As indicated above, optionally the divergence function used, D (σm, Xk), is an LP distance function, such as Euclidian distance. Alternatively, D (σm, Xk) is KL divergence.
As indicated in optional 463, according to some embodiments of the present invention the alphabet Σ that is used for reading/inferring the encoded data may have a size M≡|Σ| that is greater than the number Z of the different building-block types (M>Z). This presents a significant advantage as it provides a higher numerical basis M of the data encoding (as compared to a numerical basis Z in case no composite letters are used), and thus higher data density may be encoded and read from the same population of molecules.
As indicated in optional 464, according to some embodiments of the present invention each letter σm of the alphabet Σ is defined by a probability vector σm≡{Pmn}|n=1 to Z indicative of relative amounts of the Z building-block types {En}|n=1 to Z. This actually provides/exemplifies a mechanism for defining an alphabet Σ of size M≡|Σ| that is greater than the number Z of the different types (M>Z). Accordingly this presents a significant advantage in terms of the high data density that may be encoded and read from the population of molecules 112.
To this end, as indicated in optional 466 the alphabet Σ may include at least one composite letter σm1, whose probability vector {Pm1n} includes two or more non-zero probabilities. Also, as indicated in optional 467, the alphabet Σ may include one or more simple letters σm2 whose probability vector {Pm2n} includes only non-zero probability (i.e. indicating a non-zero probability for only a single building-block type). Accordingly, typically, the number of simple letters is equal to the number Z of different building-block types.
Reference is now made to
According to some embodiments, the molecular data storage fabrication system 700 includes module 710 including at least L building-block containers, whereby the number L of containers is greater than the number Z building-block types (e.g. monomers/oligos), which are used for fabricating the molecular strands/sequences PMs of the molecular data storage system 100. The molecular data storage fabrication system 700 also includes a molecular strand/sequence fabrication head 720 that is fluidly connected to the L building blocks containers 710. The fabrication head 720 is configured and operable for selectable and controllable deposition of a volume of building-blocks, which are contained in a selected one of the L building-block containers. In this sense, the fabrication head 720 may be configured and operable as a monomer/building-block printing jet head capable of injecting building-blocks from a selected container according to instructions provided to the fabrication head 720 from the fabrication control system/unit 730, which is also a part of the system 700.
According to some embodiments of the present invention, Z containers, 712, out of the L building-block containers, which are marked in the figure by CNR-1 to CNR-Z are adapted for separately containing different ones of the Z different types of building-blocks {En}|n=1 to Z. The remaining L-Z container(s), 714, which are marked in the figure CNR-MX and optionally also an additional one or more containers up to CNR-MXn, are monomer mixture containers, adapted for containing one or more different mixtures, each composed of a mixture of two or more of the Z types of basic molecular building-blocks.
According to some embodiments of the present invention the fabrication control unit 730 is configured and operable to operate the fabrication head for fabricating the molecular data storage system 100. To this end, the fabrication control unit 730 may include a Data Block Provider 734 configured and operable for receiving/providing at least one block of data (sequence S) that is to be encoded in the molecular data storage system 100. According to some embodiments of the invention, the data of the data block is encoded by “printing”/synthesizing a population of molecular strands/sequences at a region designated for the data block, on a support substrate/plate 750.
The fabrication control unit 730 may also include an alphabet Data Provider 732 which is adapted to provide (e.g. receive and/or retrieve from a reference data storage (e.g. local or remote memory) data indicative of an alphabet Σ, which is to be used for encoding the block of data on the designated location of the support substrate 750. As indicated above, according to some embodiments of the present invention, the alphabet Σ is of size greater than the number Z of different building-block types: |Σ|≡M that >Z. To this end, each of the letters {σm}m=1 to M in the alphabet Σ may be defined by respective probability vector σm={Pmn}|n=1 to Z that is indicative of expected probabilities {Pmn} that basic building blocks (monomers/oligos) of respective types {En} are synthesized at a designated location (k) along the molecular strings/strands/sequences of the population, at which the letter σm is encoded.
Accordingly, the numerical basis for encoding the block of data (sequence S) provided by 734 is the size M of the alphabet Σ provided by 732. To this end, the fabrication control unit 730 may also include a Data-Block Coder 736 adapted to process the received block of data (sequence S) to present it as a sequence of letters of the alphabet Σ with the numerical basis M>Z. To this end the block of data is coded by a sequence S of letters {σk}|k=1 to K belonging to the alphabet.
The fabrication control unit 730 includes a Synthesizing Controller 738 adapted for synthesizing a population 112 of molecular strands/sequences encoding the data block at the designated region on the support substrate 750. Synthesizing Controller 738 is configured for preparing operational instructions of operating the fabrication head 720, to sequentially deposit volumes/amounts of building-block types/mixtures from the containers 710, whereby the sequence of deposited building-block types/mixtures corresponds to the sequence of letters {πk}|k=1 to K in blocks of data. Simple letters are synthesized by depositing a volume of a respective monomer type obtained from the Z containers; and composite letters are synthesized by depositing a volume of a mixture of building-block types with concentrations matching the probability vector of the composite letter, obtained from one of said one or more mixture containers.
Thus, as indicated above, the alphabet Σ may include up to Z simple letters whose probability vectors {Pn}m include only one probability having non-zero expected value, and one or more composite letters whose probability vectors include two or more probabilities having non-zero expected value. The Z containers, 712, which are marked in the figure by CNR-1 to CNR-Z, are adapted for each storing/containing basic molecular building-blocks of a single type. Accordingly, for fabricating a simple letter σm, the fabrication head 720 draws the respective type of building-blocks (the only one having non-zero probability in the probability vector {Pn}m Pmn of the simple letter) from the corresponding one of the Z containers 712 in which the respective building-block type is contained. For fabricating a composite letter σm, the fabrication head 720 draws a respective mixture of the types of building-blocks whose probabilities are non-zero p in the probability vector Pmn of the composite letter for one of the L-Z containers, 714, whereby the concentrations/amounts {Cn} of the different types {En} of building-blocks in the mixture corresponds to the probability vector {Pn}m of the letter. As indicated above/below, such mixtures (e.g. mixtures corresponding to each composite letter) may be a priori prepared and contained in one of the building-block mixture containers (e.g. L-Z mixture containers CNR-MX to CNR-MXn may be included, one for carrying building-block mixture per each composite letter in the alphabet Σ). To this end, in such embodiments, the Synthesizing Controller 738 may be adapted for operating the fabrication head 720 for drawing the building-block types/mixtures of the respective simple/composite letters from the respective containers according to the respective letters that need to be synthesized.
Alternatively or additionally, the corresponding mixture for the composite letter am may be prepared “on the fly”, e.g. on demand at the time each composite letter should be printed/synthesized. In this case as few as only a single mixture container CNR-MX may be included in 714, and the system may include a building-block mixer 715 (i.e. also referred to hereinafter as mixer), that is fluidly connected to the Z containers 712 of the respective building-block types, and adapted for drawing/mixing controlled amounts of the Z building-block types from the Z containers 712 for preparing in the mixture container CNR-MX, a controlled mixture of the different types {En} of building-blocks with respective concentrations/amounts {Cn} corresponding to the probability vector {Pn}m of the composite letter σm that should be encoded. To this end, in such embodiments, the Synthesizing Controller 738 may be adapted for operating the mixer 715 for preparing, on demand, different mixtures of basic molecular building-blocks, which are associated with different respective composite letters that need to be synthesized. Also, the controller 738 may be adapted for operating the fabrication head 720 for synthesizing a simple letter by drawing the corresponding building-block type from the respective one of the Z containers 712 and depositing it on the respective location in the substrate 750, and synthesizing a composite letter by drawing the corresponding mixture prepared in the mixture container CNR-MX, or from other one of the mixture containers 714 if such are included in the system, and depositing it on the respective location in the substrate 750.
As may be appreciated by those versed in the art, the fabrication head 720 may be configured similar to conventional molecular strands/sequences fabrication heads used for controlled synthesis of molecular strands/sequences. For instance see [5]. Also, according to some embodiments of the present invention, the types of basic building blocks (monomers/oligos) contained in the containers 710 are “blocked” (i.e. capped/protected; e.g. such as described in [5]) from one end thereof, in order to prevent their binding to one another. Accordingly, in some embodiments of the present invention the system 700 (e.g. the fabrication head 720) is configured and operable for carrying out the following after each deposition, at the designated region, of a volume of basic building blocks corresponding to each of the letters of the sequence S:
Additionally, in some embodiments, the fabrication head 720 is configured and operable for depositing cleavable molecules at the designated region at which the population of the molecules should be synthesized. This is typically performed prior to the synthesizing. The system may also include a harvesting module 727 configured and operable for harvesting the population of molecules 112 from the designated region (e.g. by cleaving the cleavable molecules). The control unit may be adapted to operate the fabrication head 720 for depositing the cleavable molecules on the designated region of the substrate 750, prior to synthesis of the population of molecular strands/sequences. Then, synthesis of the population of molecular strands/sequences on the designated region such that they are bonded to the cleavable molecules is performed; then, after synthesis is completed, operating the harvesting module 727 for harvesting the population of molecules 112. Cleavage of molecules from surfaces that support the synthesis is described in the literature (Ref [5]—Leproust et al NAR 2010).
As indicated above, in some embodiments the molecular strands/sequences of the population should include similar identification segments (e.g. typically but not necessarily similar to all molecules of the population) whereby the identification segment includes an identifying sequence of the Z building-blocks/monomer types. Accordingly, the control unit 730 may be adapted for operating the fabrication head 720 for synthesizing the identification segment for all molecules of the population. This is achieved by drawing the building-block types from the Z building-block containers 712, while not utilizing the mixture containers 714 or the building-blocks mixture (since only simple letters should be included in the identification segment).
In some embodiments the molecular data storage fabrication system 700 is configured and operable for fabricating different populations corresponding to different data-blocks 110 at different respective regions of the substrate 750. To this end the system may include a fabrication head position actuator 725 connectable to the fabrication head 720. The control unit 730 may be adapted for operating the fabrication head position actuator 725 for actuating/moving the fabrication head 720 to various designated regions on the substrate 750 and operating the fabrication head 720 to fabricate at each region a population of molecules corresponding to one of the plurality of data blocks. This provides for synthesizing a plurality of populations of molecular strands/sequences encoding data of a plurality of respective data blocks, at different spatially separated respective regions of the substrate 750.
It should be noted that in some embodiments, e.g. where harvesting is not performed, the molecular storage system 100 may actually be support plate/substrate 750 with the one or more populations of molecules thereon that were synthesized at the different regions thereof. Each population is associated with a respective data-block. Alternatively or additionally, in some embodiments e.g. where harvesting is performed, the harvested populations may be placed in separate containers/containing-regions, or in a common container in case the molecules of each population can be exclusively identified by an ID segment included therein. In this case the molecular storage system 100 is actually implemented by the separate containers and/or the common container with the populations of molecules therein.
In various embodiments the molecular data storage fabrication system 700 may be configured and operable for implementing the method 600 illustrated in
According to various embodiments of the present invention, the method 600 includes the following: In 610, a support substrate/plate 750 is provided with one or more spatially separated regions at which one or more respective populations of molecular strands/sequences can be synthesized. The synthesizing may be formed with Z different types of basic building blocks (monomers/oligos).
In 620 one or more blocks of data which are to be respectively encoded by one or more respective populations of molecular strands/sequences, are provided. As indicated above, the one or more respective populations of molecular strands/sequences are to be respectively synthesized at the one or more spatially separated regions of the support substrate/plate 750. Generally, the one or more blocks of data are coded by a sequence of letters {σk}|k=1 to K of an alphabet Σ≡{σm}|m=1 to M of size |Σ|=M, each letter σm of the Σ being defined by a probability vector {Pmn}|n=1 to Z. In this regard, it should be understood, that considering a certain predetermined inference error rate, the blocks of data may include error correction code (such as Reed-Solomon codes), usable as correcting errors in the read data. For example the inference rate may be 90% or even above (98% or 99%), and the error correction code in the data blocks themselves may be used after the data was inferred in order to correct residual data errors, which were not corrected/overcome by the distance function.
In 630, a population of molecular strands/sequences is synthesized per each block of data, at a respective region of the one or more regions of the support plate. The molecular strands/sequences of the population are synthesized with building-block strings formed with a number Z<M of different types of building-blocks {E)}n- to z, (whereby M is the number of letters in the alphabet, and is actually the numerical basis by which the block of data is encoded). To this end the synthesizing of the population of molecular strands/sequences at the respective region includes synthesizing the sequences of letters {σk}|k=1 to K corresponding to the data of the data-block. Synthesizing each letter may be carried out by depositing a composition of building-blocks {En}|n=1 to Z of the Z different types with relative concentrations {C(En)}|n=1 to Z corresponding to the probability vector {Pk,n}|n=1 to Z of the respective letter σk.
To this end, the depositing may optionally include:
632—Providing a volume of a composition of building-blocks with said relative concentrations. Optionally, the building-blocks provided in the composition, are “blocked” from one end to prevent their binding to one another. Optionally, the volume of the composition of building-blocks is acquired from a pre-prepared mixture having the desired relative concentrations corresponding to the letter σk which is to be synthesized. Alternatively, the volume with the desired concentrations is prepared in-situ (e.g. in real time per each synthesized letter σk).
634—Depositing/placing the composition of building-blocks at the respective region to thereby enable binding at least some of the building-blocks in the composition to molecules at that respective region. Optionally, after the depositing, the respective region is washed to remove un-bonded building-blocks of the deposited composition.
Then, optionally, an un-blocking treatment may be applied to the deposited building-blocks that are bounded to the molecules at the respective region, in order to “un-block” those building-blocks so that they can bind to other building-blocks that are to be deposited when synthesizing the successive letter.
According to some embodiments, the method further includes 640 for synthesizing, in the molecular strands/sequences of the population, a population identification segment indicative of the population and including an identifying sequence of the Z types of building-blocks. To this end, in some examples a difference between identifying sequences that are used in population identification segments of different respective populations may exceed a certain threshold of edit distance (e.g. edit distance of 2, 3 or higher—lower edit distance may be used for more accurate synthesis).
Also, optionally, according to some embodiments, the molecular strands/sequences are bonded to cleavable molecules that were a-priori residing at the certain region. Accordingly, in optional 650 the population of molecules may be harvested by cleaving the cleavable molecules. To this end, in some cases the support plate 750 includes cleavable molecules adapted to bind with said building-blocks, such that building-blocks of the composition, which are first deposited on said region, are bounded to the cleavable molecules.
Section A in the figure shows a conventional DNA based storage scheme. The binary message is encoded to DNA by mapping every 2 bits (depicted by the red separating lines) to a DNA base or synthesized position (i), the designed DNA sequence is then synthesized and sequenced (e.g. typically by a noisy procedure that introduces some errors) (ii). The sequencing output is then used to infer the DNA composition at every position (iii). Decoding of the original message is done assuming the use of an error correcting code over the binary message (iv).
Section B in the figure shows the same message encoded using a composite DNA alphabet of resolution R=10. Accordingly, mapping is carried out every 8 bits (depicted by the blue separating lines) of the binary message, to a single composite DNA position/letter. Using sufficiently deep sequencing (e.g. of N=50 or N=100, or even lower N=10), allows to correctly identify the original composite letters, (the position marked by an asterisk is exemplified in section C in the figure), and to decode the message, also including an error correction mechanism.
Section C in the figure exemplifies an inference step of a given DNA position. The observed frequencies/concentrations of the nucleotides are used to infer the source/original letter, σ=(0,0.6,0.4,0), as the closest composite letter, based on the KL divergence.
The feasibility of the composite DNA letters was demonstrated by fabricating, by the inventors, a complete molecular/polymeric data storage system (DNA based in this implementation) encoding message of 38 bytes using four composite alphabets of different resolutions. The message was encoded with information/data densities of about 4.3 bits per synthesized position. The composite DNA sequences were concatenated to flanking standard DNA sequences (not composite) containing a barcode (constituting a data segment), a unique molecular identifier (UMI) region (constituting an identification segment) and PCR templates used for constructing Illumina sequencing adapters. The designed DNA oligos (of length=99 bases) were synthesized using commercial technology (IDT, Leuven). The synthesized DNA was amplified using PCR, pooled together and sequenced using Illumina Mi-Seq. The reads were then analyzed to decode the original message.
Then the minimal sequencing depth required to correctly decode the message for each one of the four composite alphabets, was examined. As expected, extending the alphabet by using higher resolutions requires deeper sequencing. In all four alphabets that were tested, a fully successful decoding was observed with sequencing depths as small as N=100 (while a near-perfect decoding was obtained with even smaller sequencing depths, little N=50).
The inventors have encoded a short input message (“DNA STORAGE ROCKS!”) using an encoding pipeline such as that disclosed in Method 200 described above, and more specifically utilized the encoding pipeline including the following steps:
The populations of encoded composite DNA sequences, for each of the above four composite alphabet configurations, were inserted into a synthetic construct containing amplification primer templates, a unique molecular identifier (UMI) and a barcode to obtain a total oligo length of 99 bases. The four designed oligonucleotides were then commercially synthesized, amplified using PCR primers from the Illumina small RNA sequencing kit, and sequenced using an Illumina Mi-Seq.
Sequencing was performed to read the encoded data. 5,421,556 50 bp paired-end reads were obtained of the four different samples. The read pairs were merged to generate 4,855,676 reads, 95% of which had a designed length of 52 bases. Then the reads were split into four different samples using the barcode (ID-SEG values) value yielding about 25% of the reads per each sample alphabet.
Next, the original message was decoded using a decoding pipeline such as that described with reference to method 400 above. More specifically in this case the message decoding/reading included/consisted of the following steps:
For each alphabet sample, the ability to decode the entire message (including the repetition introduced to equalize oligo length), and also only the first occurrence of the original encoded message text, was tested. To test for the required sequencing depth for each sample alphabet representing a specific resolution, different numbers of reads were sampled from the total sequencing depth sampled, and the decoding process was repeated for each such sub-sample of the sequencing depth. The sampling process was carried out/repeated for 100 times for each sampling rate, and the inference rates and the overall decoding outcome for each sample were recorded.
The inventors designed a synthetic composite DNA oligo using the same overall design with the following alterations:
The 145 composite bases consisted of all the possible pairs of composite letters. This oligo design was constructed as a de Bruijn sequence using the following methodology. A balanced circular de Bruijn sequence over an alphabet of 12 letters composed of the eleven composite letters (15 IUPAC letters minus the four standard bases) plus one extra letter was constructed. The occurrences of the extra letter were then replaced by the standard DNA bases in a cyclic manner.
This 192 base oligo (de Bruijn+primers) was then synthesized, processed and sequenced using similar procedures to the above with the following differences:
As a result 1,086,991 150 bp paired-end reads were obtained. The read pairs were merged to generate 1,017,813 reads, 90% of which had the designed length of 145 bases. Then a similar pipeline to the one described above was used to calculate inference rates for each position in the sequence and to investigate the properties of the error rates.
The results are described in
The inventors altered the DNA fountain code [5] to support composite DNA sequences, creating what we called a composite DNA fountain system:
To test the feasibility of the suggested composite DNA fountain system, the inventors encoded the same message file of 2,116,608 bytes used in [3] using composite DNA in resolutions k=2,4,6,8,10, and simulated reads of different depths for each of the composite resolutions, and examined the minimal depth required to successfully decode the original binary message.
Thus the present invention provides novel systems and methods introducing the use of composite molecular/monomer alphabet DNA, to leverages properties of molecular based data storage and attain higher density based storage systems. Composite DNA/molecular alphabet schemes can be combined with other approaches to increase capacity and fidelity of molecular/DNA based storage systems. For instance, as will be appreciated by those versed in the art, without departing from the present invention, the composite alphabet scheme of the present invention can be combined with orthogonal base pair techniques such as disclosed in [7], efficient coding techniques such as disclosed in [3], [8], [9] and/or with random access approaches such as disclosed in [6], [9], [10].
Number | Date | Country | |
---|---|---|---|
62674114 | May 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/IL2019/050572 | May 2019 | US |
Child | 17101824 | US |