The present application claims priority from Australian Provisional Patent Application No 2020903611 filed on 6 Oct. 2020, the contents of which are incorporated herein by reference in their entirety.
This disclosure relates to creating oligonucleotide sequences to represent digital data.
Counterfeiting and piracy has increased substantially over the last two decades, with counterfeit and pirated products found in almost every country across the globe and in virtually all sectors of the economy. Estimates of the levels of counterfeiting and the value of such products vary. However, the value of global trade in counterfeit and pirated products in 2013 was estimated at $461 billion (OECD and EUIPO, 2016, Trade in Counterfeit and Pirated Goods: Mapping the Economic Impact). For example, counterfeit drugs are responsible for one million deaths and cost the industry $200 billion each year. Recent studies estimate that 10% of drugs sold each year are counterfeit, a number that is anticipated to increase with the rise of online pharmacies and 3D-printed medicines. The rapidly expanding medicinal and recreational cannabis markets are also particularly exposed to counterfeiters who may produce compositionally similar but substandard products with basic equipment.
One way to address these challenges may be by labelling products with encoded DNA tags. However, this often requires raw signal data to be first base-called into DNA code, i.e. A, C, G, T. The conversion of raw signal data to base-called data is computationally expensive and not compatible for laptop and smart phone sequencing devices such as the Oxford Nanopore MinION or SmidgION.
A method for creating an oligonucleotide sequence to represent digital data comprises:
The electric sensor may comprise a nanopore.
The method may further comprise determining the first set by selecting the multiple oligonucleotide sequences from multiple candidate sequences.
Selecting the multiple oligonucleotide sequences from multiple candidate sequences may be based on a distance between a first candidate sequence and a second candidate sequence. Determining the first set may comprise calculating the distance between a first simulated electric time-domain signal from the first candidate sequence and a second simulated electric time-domain signal from the second candidate sequence. Calculating the distance may comprise calculating an error of matching the first simulated electric time-domain signal to the second simulated electric time-domain signal subject to a time domain transformation that minimises the error. Calculating the distance may be based on dynamic time warping or correlation optimised warping.
Determining the first set may comprise performing a Trellis search across different combinations of nucleotides.
The method may further comprise inserting a spacer sequence between each two of the multiple oligonucleotide sequences. The spacer sequence may be of sufficient length to generate, for a second oligonucleotide sequence from the first set, a predictable interference from the spacer sequence and not a preceding first oligonucleotide sequence.
The one or more nucleotides present in the electric sensor at any one point in time may comprise a number f of nucleotides present in the electric sensor at any one point in time, and the spacer sequence may be of length ks with f≤ks≤2f.
The spacer sequence may comprise one or more of:
The method may further comprise selecting the spacer sequence from a second set of spacer sequences comprising more than one spacer sequences to encode further digital data.
The method may further comprise repeating the method to create more than one oligonucleotide molecules comprising spacer sequences between oligonucleotide sequences, the spacer sequences being selected to create an index between the more than one oligonucleotide molecules.
The method may further comprise repeating the method to create more than one oligonucleotide molecules comprising spacer sequences between oligonucleotide sequences, the spacer sequences being selected to obfuscate data encoded in the more than one oligonucleotide molecules.
The method may further comprise decoding the digital data from the single oligonucleotide molecule. Decoding may comprise capturing an electrical time-domain signal indicative of an electric characteristic of one or more nucleotides present in an electric sensor at any one point in time as the single oligonucleotide molecule passes through the sensor; and identifying the multiple oligonucleotide sequences from the first set in the captured electrical time-domain signal.
Identifying the multiple oligonucleotide sequences from the first set may comprise matching the captured electrical time-domain signal against simulated electrical time-domain signals associated with the multiple oligonucleotide sequences in the first set.
Decoding may further comprise:
Decoding may be based on dynamic time warping or correlation optimised warping between each split and the multiple oligonucleotide sequences in the first set.
The method may further comprise synthesising the molecule; and adding the molecule to a product for verification of the product.
Verification of the product may comprise decoding the digital data from the molecule; and performing an cryptographic operation in relation to the digital data and verify the product based on verification data.
Software, when executed by a computer, causes the computer to perform the above method.
A computer system for creating an oligonucleotide sequence to represent digital data comprises:
An oligonucleotide molecule represents digital data, wherein the molecule comprises multiple oligonucleotide sequences combined into the molecule, wherein the multiple oligonucleotide sequences are configured to generate an electric time-domain signal from one oligonucleotide sequence that is distinguishable from the electric time-domain signal from another oligonucleotide sequence, the electric time-domain signal being indicative of an electric characteristic of one or more nucleotides present in an electric sensor at any one point in time.
The multiple oligonucleotide sequences combined into the molecule include two or more of the sequences provided in one of the following sets of nucleotide sequences:
A kit for verifying a product's identity comprises one or more of the above oligonucleotide molecules.
A method for manufacturing an identifiable product comprises:
The method may further comprise:
A method of verifying a product's identity, the method comprising:
The method may further comprise determining a hash value of the decoded digital data, and comparing the hash value to a predetermined value for the product to verify the product's identity.
An identifiable product comprises:
The digital data may be associated with a first hash value and the first hash value allows comparing a second hash value of a result from decoding the digital data to the first hash value to verify the product's identity.
The product may further comprise a package containing the product, wherein the first hash value is incorporated onto the package.
In the above method, the above software, the above computer system, the above oligonucleotide molecule, the above kit, or the above identifiable product, the first set of multiple oligonucleotide sequences consists of:
Optional features disclosed in relation to one of the aspects of method, computer system, molecule, product, software and others, are equally optional features to the other aspects.
As set out above, there is a need for methods and systems against counterfeiting and piracy. One solution is to add oligonucleotides to products, components, constituents of mixtures etc. Information encoded into these oligonucleotides can be used to verify the producer of the product. More particularly, the producer generates digital data, such as a secret based on cryptographic algorithms including hash or encryption algorithms. The digital data is then encoded into a oligonucleotide sequence and a corresponding molecule is synthesised and added to the product. A customer, receiver or processor of the product can extract the molecule and decode the digital data encoded thereon. The customer, receiver or processor can then verify the product, such as by performing corresponding cryptographic algorithms and comparing the result to the decoded digital data.
In one example of addressing challenges to supply chain monitoring, an alphanumeric identifier may be encoded into a synthetic oligonucleotide using the approaches disclosed herein. Either the alphanumeric codeword, or the oligonucleotide sequence, or a combination of both, or a combination of both plus some padding text, may be passed through an encryption algorithm that generates a hash value. Because hash functions are deterministic and computationally infeasible to reverse engineer, the alphanumeric hash value of the oligonucleotide may be displayed publicly on a package, for example, as a string of alphanumeric characters or as a data matrix or QR code. The encoded oligonucleotide is added (mixed in or affixed to) a product or ingredient, thereby giving the product or ingredient a unique oligonucleotide ‘fingerprint’. The hash value representation of the oligonucleotide in the product or ingredient may be displayed on the product packaging, thereby creating an immutable link between the product and packaging.
This approach may also be used for multiple ingredients in a product, where each unique ingredient hash value is concatenated together and hashed again to form a binary tree of hashes (analogous to block chain). At the point where a final product is made or assembled, the final product batch hash value is a representation of all of the ingredient hash values in the final product. If desired, the batch hash value may then be hashed with a counter or time stamp to generate a unique hash value for individual packages from the same batch. The resulting unique package hash value may be considered analogous to a serial number, but with the security advantage that the package hash value (displayed as a QR or data matrix code) is immutably linked to ingredients in the product, rather than being an arbitrary number. The unpackaged product may be verified by recovering, sequencing, decoding, and hashing the oligonucleotide tags in the product, and either looking up product information associated with the resulting hash value/s in a database, or cross-validating the oligonucleotide derived hash value/s with the package hash value. Further examples can be found in PCT publication WO 2020/028955 entitled “SYSTEMS AND METHODS FOR IDENTIFYING A PRODUCTS IDENTITY”, which is incorporated herein by reference.
In one example, the hash argument may comprise a product code or manufacturing code or simply a random number that is not associated with any particular identifying functionality. A computer calculates a first hash value of the hash argument. The hash value is calculated by a hash function which can take a range of different forms depending on the security requirements of the overall system. For example, a hash value may be calculated by multiplicative hashing where the overall number of different sequences is limited and therefore collision is unlikely. In other examples, more sophisticated functions, such as MD5 or preferably, SHA-2 or SHA-3 can be used. Since these sophisticated functions are highly optimised, the computational burden is minimal and therefore, there is little downside to using a hash function that is more sophisticated than required by this particular application.
After, before, or during calculating the hash value, the oligonucleotide sequence is determined to encode the hash argument, that is, the plain text before hashing. The sequence is then used to synthesise a molecule using known techniques and added to the product. This may involve mixing the synthesised (chemical form) of the molecule into the product. The product may then pass through a supply chain to reach a recipient, such as the end customer or an intermediate manufacturer or quality control agent.
It is now desired that the recipient can verify the identity of the product. Therefore, the recipient sequences a second oligonucleotide sequence from the product, where it is unknown whether that sequence is the same as the sequence of the molecule added by the original (or ‘upstream’) manufacturer. To verify this, the intermediary can decode digital data encoded in the molecule and calculate a second hash value of the sequenced molecule and compare 107 the second hash value to the first hash value to verify the product's identity. If the second hash value is identical to the first hash value, the product's identity is verified. If the hashes are different, the product's identity is not verified.
The hash value may also be calculated based on additional data that may be a product identifier, entity identifier of the handling entity at that point, shared secret, public key, time stamp, counter, or product-unique product identifier that is unique to that particular individual “instance” of the product. This additional data may either be concatenated with the oligonucleotide sequence before the hash is calculated or the hash of the oligonucleotide sequence may be concatenated with the additional information and another hash calculated on the result. The important aspect is that any minor chance in the additional data leads to a completely different hash and it is practically impossible to change the additional data such that the hash stays the same or to determine the additional data from the hash alone.
A package identification technology (PI) is any technology that is displayed on a package for the purpose of identifying a product. Package identification technologies may include, but are not limited to: inks, dyes, holograms, bar codes, QR codes, RFID, silicon dioxide encoded particles, product spectral image data, and IoT devices. The PI may display a hash value at any node of a manufacturing process or supply chain.
The use of hashing functions permits a safe and secure link between the molecule tags in the product, and the product packaging.
Palm oil. Palm oil is used is a wide range of products including food products, cosmetics, cleaning products and pharmaceuticals. Palm oil production is also linked to deforestation, biodiversity loss and poor work conditions. The disclosed technology may be integrated with existing certification schemes (for e.g RSPO) so that the origin of palm oil can be traced back to a sustainably certified manufacturer from the end product alone.
Pharmaceuticals. Counterfeit pharmaceuticals are responsible for one million deaths and cost the industry $100B each year. Incidents of drug counterfeiting are increasing with the rise of online pharmacies. Additionally, in many developing and transition economies, medications are sold as unpackaged individual tablets or doses. The capacity to recover supply chain information from an individual tablet alone could address the massive human and economic cost of fake pharmaceuticals.
Cannabis products. The cosmetic and medicinal cannabis industry is highly exposed to counterfeiting from backyard and recreational growers. Fake products present serious concerns as the active compound content in cannabis (THC, CBD) may vary widely in plants that are grown under different conditions and across different plant strains. Fake medicinal products that have not be subjected to stringent quality control steps, and contain sub-therapeutic cannabinoid levels, may lack therapeutic efficacy. Additionally, in some countries such as the USA, products must be grown, manufactured, and sold within state boundaries for tax purposes. The ease with which products may cross state boundaries could result in the loss in billions of dollars in tax revenue. The disclosed invention offers a means to track material from the ‘plant to product’, as well as mark various mixing and quality control steps along the manufacturing/supply chain. This information can be recovered from the unpackaged end product alone, and thereby address the problems highlighted above.
Illicit drug precursors (e.g. methamphetamine). The disclosed technology may be used to traceback the chain of custody of products that are misused. For example, legal ingredients used as precursors for the manufacture of illicit drugs, such as methamphetamine, may be traced to the last legitimate node in a supply chain from a drug sample alone. This capability may be useful for pinpointing fraudulent or leaking nodes in a supply chain, and gathering intelligence on how narcotics networks operate.
Kosher and Halal. Kosher and Halal products cannot be identified by the end product alone (there is no test of Kosher and Halal). The disclosed technology may be used to verify and track products from certified Kosher and Halal producers, and thereby address widespread counterfeiting problems in the industry.
Milk products. Counterfeit milk products are frequently detected in Asian markets, and have resulted in the hospitalisation of more than 50,000 infants from melamine poisoning since 2008. The capacity to recover and verify all supply chain information, from the milk product alone, could address this problem.
Ammunition. Recent advances in firearms technology have exacerbated the already difficult task of detecting illicit arms and ammunition transfers. In 2012, firearms were responsible for 41% of non-conflict homicides worldwide, with approximately 57% of these incidents remaining unsolved. In 2016, President Obama and the American Medical Association declared gun violence a public health concern, which is estimated to cost the US economy $229 billion each year—even more than the cost of obesity. The advent of modular, polymer, and 3D printed guns have also brought new challenges for firearms tracing and registration. The capacity to label and trace oligonucleotide tagged ammunition to the bullet entry wound has been demonstrated previously. The innovation disclosed offers a way to trace and trace crime via labelled ammunition.
Other applications. The disclosed technology may be used to track and trace many other products including, but not limited to: wine, cosmetics, precious stones, chemicals, fertilizers, bank notes, casino chips, and luxury items.
It is noted that some examples herein relate to the use of DNA but it is noted that other types of oligonucleotide sequences, such as RNA or DNA/RNA hybrid with five different nucleotides or bases can be used to represent digital data.
In Nanopore sequencing as in
The f bases inside the pore at a given time is the ‘state’ of the pore, and each state should produce a unique current level. Even the durations of these levels should be state-dependent. What makes basecalling that much more difficult is the level and duration of the current being affected by a number of factors other than the state, such as base stacking in the pore or the upstream functioning of the motor protein (for e.g.). The effects of these factors, and even all factors that can have an effect, are not completely known. Thus, the current signal can sometimes look quite ‘random’, and the signals for a particular DNA string, measured using the same device but at different times, could look quite different from one another. This stochastic nature of signals presents a significant challenge to basecalling DNA or RNA using nanopore technology.
This disclosure provides a bypass of the basecaller, and operates directly on the ‘raw’ current signal measured by the Nanopore device, which is also referred to as a ‘soft decision decoding’ system. An additional advantage of such an approach is that the current signal, or the ‘soft data’, contains more information than the ‘hard’ output of a basecaller, which can be used to increase reliability.
Computer receives a time-domain electric signal from read-out electronics 103 and decodes digital information that has been encoded in the DNA string 120. In that sense, processor 111 executes program code installed on non-volatile program memory 112, which causes processor 111 to perform the methods disclosed herein, such as methods for decoding data or methods for encoding data, such as method 200 in
When method 200 is performed by processor 111, processor 111 selects 201 from a first set of multiple oligonucleotide sequences one oligonucleotide sequence for each of multiple parts of the data. That is, there is a set of sequences (later referred to as ‘symbols’) and symbols are selected to represent parts of the data. For example, a part of the data may be a byte with 8 bits or a part of different length. The multiple oligonucleotide sequences (‘symbols’) are configured to generate an electric time-domain signal from one oligonucleotide sequence that is distinguishable from the electric time-domain signal from another oligonucleotide sequence. For example, and as detailed below, the signals may have a maximum or above-threshold distance as calculated by dynamic time warping. As set out above, the electric time-domain signal is indicative of an electric characteristic of one or more nucleotides present in an electric sensor 101 at any one point in time.
Processor combines 202 the one oligonucleotide sequence for each of multiple parts of the data, that is the selected symbols, into a single oligonucleotide sequence that represents a single oligonucleotide molecule 120 to encode the digital data.
The method may then further comprise synthesising the molecule and adding it to a product. The digital data encoded into the molecule is calculated such that it, once decoded, can be used to verify the product.
Consider a system where data is encoded at the base-level, and a soft decoder is applied on the current signal measured. We denote the length of the DNA string after encoding with b bases. If f bases fit inside the pore at any one point in time, the current signal recorded may include up to b−f+1 different states. As the encoder is operating on bases, the decoder also requires base-level data. For a soft decoder, this means (b−f+1) probability vectors, one for each state. The i′th such vector would contain the probabilities of the i′th state being each possible set of f bases, or f-mer. Preferably, the decoder should be able to process these probability vectors and produce a reliable output.
This disclosure provides an alphabet for soft decision encoding. Each ‘letter’ of this alphabet AD of size |AD|, referred to as a ‘symbol’, is matched to a uniquely identifiable current signal di(t), which is produced by a short corresponding base sequence, Di. Information is represented using this ‘encoding’ alphabet, to which redundancy can also be added. For storing data, each letter is replaced with its short base sequence. Also, in-between each pair of such sequences, a short polynucleotide ‘spacer sequence’ Si is added from the alphabet AS of size |AS|. When the final sequence is synthesized and read by the Nanopore device, the current signal contains the signals from the encoding alphabet di(t), separated by the almost flat signals si(t) produced by the polynucleotide spacer sequences, or in some cases distinctive ‘spikey’ signals. In the examples given in this disclosure, a range of spacer sequences were tested. The decoder ‘extracted’ the signals from the alphabet and proceeded to decode information in the codeword. We refer to these extracted signals as signals ‘received’ by the decoder.
In decoding, each received signal is compared to all the reference signals in the alphabet of data symbols AD and spacers AS. Rather than using probabilistic approaches, the dynamic time warping (DTW) or correlation optimised warping (COW) cost between a reference signal and a received signal is used as the decoding metric. For each received signal, a vector of DTW costs is computed, and the decoder operates on these. The output of the decoder is a valid vector with the lowest overall DTW cost (computed as the sum of costs of each received signal). It should be noted that the encoding-decoding system here has no knowledge of bases; it only uses an alphabet composed of different current signatures di(t) and si(t).
Another concern in DNA data storage is the presence of the complementary strand. Single stranded sequences of DNA (ssDNA) that undergo amplification generate a complementary strand and become double-stranded DNA (dsDNA), and it is possible (about 50% of the time) that the current signal measured is for that strand. To circumvent this difficulty, this disclosure investigates multiple approaches:
In order to compute the reference signals for the short base sequences, we used the squiggle function available in ‘Scrappie’ (available from https://github.com/nanoporetech/scrappie). Using this software, it is possible to obtain an ‘average’ signal for any base sequence, which we call the ‘signature’ of the sequence. To compute the reference signals for the short base sequences some ‘training’ is performed beforehand. In one methodology for doing this, DNA sequences containing symbol sequences from AD separated by spacer sequences from AS are synthesized and then read using a Nanopore device. A clustering algorithm is run on the set of raw current signals. To decide the DNA sequence of each resulting cluster, a basecaller is used. Sequences that matched to the majority of signals in the basecalled cluster are taken as the sequence of that cluster. Reference signals were computed by averaging all the signals in the cluster, using DTW Barycenter Averaging.
In the first iteration of the disclosed encoding system, we tested codewords that were simply constructed from a string of data symbols from the set AD as shown in
Data and spacer symbol selection is performed iteratively by evaluating simulated raw squiggle output, selecting candidate sequences, and generating and evaluating real output. When data alphabets AD and spacer alphabets AS are identified, machine learning algorithms may be applied to sequences assembled from the alphabets to aid decoding. Machine learning may be used for data decoding after spacer decoding, or it may be used for decoding both spacer and data symbols. In both cases, the neural network used for decoding should be trained with large amounts of ‘noisy’ data for which the underlying sequences/symbols are known. With the network trained sufficiently well, the raw signals generated when reading a DNA strand could be directly fed to it, and it would output the most likely sequence/symbol.
In some embodiments, it may be advantageous to perform tag decoding on spacer symbols S locally and data symbols D locally, whist in other embodiments it may be advantageous to perform tag decoding on S locally decoding on D remotely, and in yet still other embodiments it may be advantageous to perform tag decoding on S remotely and tag decoding D remotely.
The alphabet is a set of symbols constructed from kD nucleotides (‘mers’). We also refer to such symbols as a letter or inner codeword. As described, in some embodiments, the ID tag is comprised of alternating letters (inner codewords) from the set AD and AS. Here, we disclose a methodology to select oligonucleotide inner codewords using dynamic time warping (DTW) cost as a metric, measured as either absolute distance or Euclidean distance. First, we constructed 5 sets of 500 random symbol sequences of length kD=8, 10, 12, 14 and 16 nucleotides, within the following constraints:
From the 500 candidate symbols, we selected alphabets of size |AD|=16, 64, 256 symbols using the absolute and Euclidean distance threshold metrics in DTW given in Table 1 and Table 2. Table 3 shows that kD symbol length selection is a trade-off between the code rate (bits nt−1) and minimum absolute and Euclidean distance required for reliable decoding.
We disclose the following three approaches for picking the alphabet. For all cases symbol selection is performed iteratively by evaluating simulated raw squiggle output, selecting candidate sequences, and generating and evaluating real output.
This approach comprises computing pair-wise DTW cost between randomly generated k-mers, then picking a set where the minimum DTW cost is larger than some pre-defined threshold. Clustering algorithms, known to those skilled in the art, may also be applied to identify the best sets of symbols in terms of DTW or COW distance.
Signatures for all possible 5-mers (a state of the nanopore) can be obtained from Scrappie. This would amount to 45=1,024 different signatures. Using these, a trellis search can be conducted to obtain a set of sequences that generate a signature set for which the minimum pair-wise DTW distance is larger than a certain pre-set threshold (Dmin).
Trellis built for the search would have kD−4 stages, each with 256 states, and 4 branches from each state. Search would start with a randomly generated kD length DNA sequence. This would always be included in the alphabet picked. Picking a sequence for the alphabet amounts to finding a path along the trellis that creates a signature which has a DTW distance >Dmin with all sequences already included in the alphabet. Viterbi algorithm could be modified to find such a path.
In this approach, DTW distance is not the metric for selecting the sequences for the alphabet AD; symbol error probability itself is used. First, similar to the trellis approach, a number of random sequences of length kD is generated. Signatures of all these are obtained from Scrappie. |AD| sequences are randomly picked for the alphabet, and then, random squiggles are generated for each (based on the distributions obtained from Scrappie), and ‘decoded’ using the signatures. Some of the sequences will then be removed due to high symbol error probabilities. Then, another set of sequences is added to the remaining ones, and the decoding test is conducted again. Searching continues in this manner until |AD| sequences are found with low symbol error rates.
Spacer symbols have four main purposes:
Ideal properties of spacers include sequences that:
If f bases from the quaternary alphabet A,C,T,G are simultaneously inside one nanopore at any time, and for example, f=5 say (b5, b4, b3, b2, b1), and that the output current signal A measured by the device estimates the base b3 (the middle base), there is a total number of 45=1,024 possible output signals A(b)=F(b5, b4, b3, b2, b1) that will appear. The duration T of each signal may also be variable and dependent on the 5 bases, i.e., T(b)=G(b5, b4, b3, b2, b1). Given that the nanopore reading frame is f bases, and assuming f=5, and raw current measurements occur at the mid-point of the reading frame, then the number of different states q in the signature generated by a strand of DNA of length b translocating the nanopore is q=b−f+1. This implies that the total number of possible different states generated for an 8-mer DNA spacer symbol, for example, is q=8−5+1=4 states, with each of these states taking on one of 1,024 possible output signals, generating a total to 1,0244>1.1E12 possible signatures.
As raw data measurements occur at the mid-point of the nanopore and assuming a reading frame of 5 nucleotides for illustrative purposes, the signature produced by any DNA subsequence will be impacted by the two nucleotides immediately before and after. This means that only the middle 4-mers of an 8-mer DNA subsequence (N ˜f+1, where N is the length of a subsequence) are not affected by the memory of flanking sub-sequences. Therefore, the minimum theoretical length of the spacer/partition sequence S is kS=f, but preferably kS=f+1, f+2, f+3, f+4, or f+5. Optimum spacer length is a trade-off between the capacity to efficiently identify the spacers in codeword signature and information rate, bounded by f.
Spacer symbol selection is performed iteratively by evaluating simulated raw squiggle output, selecting candidate sequences, and generating and evaluating real output. Spacer sequence selection was first performed by simulating ‘soft’ signatures from ‘hard’ inputs using Scrappie software. Simulated signatures of the following sequences (template/reverse complementary, T/RC) were generated and evaluated against the spacer design properties outlined above. DNA tags of length n=4 were constructed with 13 of 8-mer spacer sequences listed below. Analogue signatures for a selection of the 13 spacer symbol template and reverse complementary pairs are given in
Mean signatures of ID tags were simulated using Scrappie software and evaluated as spacers. These simulations are provided in
Spacers and spacer-symbols may be of size kS=5-16 nt, preferably 6-14 nt, preferably 6-12 nt, preferably 8-12 nt. In general spacers are of size f≤kS≤2f, where f is the number of bases in an oligonucleotide fragment that translocate a nanopore at any one time. Spacers may be any sequence, but preferably:
A more structured way of searching is choosing spacer sequences through brute force. The brute force method of searching involves generating an exhaustive or near-exhaustive set of possible spacer sequences of length kS, and picking symbols that generate a signature/s of a desired shape/s. After generating a set of random ‘hard’ sequences scrappie software was used to generate the corresponding average ‘soft’ current signatures. These signatures were then compared with the desired pattern/s, and close matches were picked as spacers. Again, brute force spacer symbol selection is performed iteratively by evaluating simulated raw squiggle output, selecting candidate sequences, and generating and evaluating real output.
Spacers and spacer-symbols may be of size kS=5-16 nt, preferably 6-14 nt, preferably 6-12 nt, preferably 8-12 nt. Spacers are of size f≤kS≤2f, where f is the number of bases in an oligonucleotide fragment that translocate a nanopore at any one time.
Here we disclose a method for increasing codeword rate r by using two alphabets, AD and AS, for an ID tag. The tag is constructed from alternating symbols from AD and AS, with each tag containing n symbols from AD and n+1 symbols from AS, as shown in
For an alternating tag of length n=4 that is comprised of 4 symbols from AD and 5 symbols from AS, i.e. Sj1Di1Sj2Di2Sj3Di3Sj4Di4Sj5 the total number of bits encoded is 52 over an encoding region of 88 nucleotides, which equates to a rate of 0.593 bits nt−1. If spacers are not used to encode information, the equivalent codeword would contain 32 bits over an encoding region of 88 nucleotides, which equates to a rate of 0.366 bits nt−1.
The alphabets AD and AS may be of any size, and comprised of symbols and spacer symbols of size kD/S=5-16 nt, preferably 6-14 nt preferably 6-12 nt, preferably 8-12 nt. Spacers are of size f≤kS≤2f, where f is the number of bases in an oligonucleotide fragment that translocate a nanopore at any one time.
Multiple spacers may also be used to encode information across multiple oligonucleotide strands in circumstances where it is desirable to use short oligonucleotide fragments (i.e <200 nt), and there is a need to encode more information than can fit in a single fragment alone. In many cases short fragments are desirable because they are less likely to degrade, are less expensive to manufacture (both in terms of per nucleotide length and per mol) and are subject to lower synthesis error rate.
Here we disclose a method to use spacers to encode an index to address individual strands to a location in a multi-strand ID tag or ‘datablock’. Refer also to
Consider the following example:
For an alternating ID tag of length n=4 that is comprised of 4 symbols from AD and 5 symbols from AS, i.e. Sj1Di1Sj2Di2Sj3Di3Sj4Di4Sj5 there 2564=4.3 billion possible AD tags and 25=32 AS tags. In this embodiment, the AS tags are used as an index to assemble the AD tags into a ‘datablock’ or multistrand ID tag. This approach permits an essentially unlimited number of 32256{circumflex over ( )}4 unique data blocks, although for practical applications each data block is not required to contain the full set of AS tags. If only four AS tags are used, for example, this would permit a multistrand ID tag space of 4256{circumflex over ( )}4.
The alphabets AD and AS may be of any size, and comprised of symbols and spacer symbols of size kD/S=5-16 nt, preferably 6-14 nt preferably 6-12 nt, preferably 8-12 nt. Spacers are of size f≤kS≤2f, where f is the number of bases in an oligonucleotide fragment that translocate a nanopore at any one time.
Watermarking is the process of hiding information in a carrier signal to improve security. Here we disclose a methodology for DNA watermarking, where one or more oligonucleotide single strand ID tags, or one or more oligonucleotide ‘blocks’ or multistrand ID tags, or a combination of one or more oligonucleotide single strand ID tags and oligonucleotide blocks or multistrand ID tags, is hidden in a larger pool of oligonucleotide fragments. Consider oligonucleotide ID tags comprised of alternating symbols from a set of data symbols (alphabet AD) and a set spacer symbols (alphabet AS). Water marking is achieved by using the alphabet AS to encode information that identifies the correct tag/s in a larger set of tags. For example:
For an alternating ID tag of length n=4 that is comprised of 4 symbols from AD and 5 symbols from AS, i.e. Sj1Di1Sj2Di2Sj3Di3Sj4Di4Sj5 there is a total of 645=1.074 billion possible configurations from the set AS. One or more configuration from the set AS may be used to identify the correct ID tag/information from a larger pool of ‘plausible’ tags. Plausible tags include any oligonucleotide strand encoded from the same alphabets and with the same parameterisation/form as correct tags, e.g. Sj1Di1Sj2Di2Sj3Di3Sj4Di4Sj5. Pools of >100,000 plausible oligonucleotide tags may be synthesised by commercial manufacturers such as IDT and Twist BioSciences. These pools may be added to the ‘correct’ tag/s at the same or similar molar concentration to achieve watermarking.
The alphabets AD and AS may be of any size, and comprised of symbols and spacer symbols of size kD/S=5-16 nt, preferably 6-14 nt preferably 6-12 nt, preferably 8-12 nt. Spacers are of size f≤kS≤2f, where f is the number of bases in an oligonucleotide fragment that translocate a nanopore at any one time.
In some embodiments, it may be advantageous to perform tag decoding locally and watermark decoding locally, whist in other embodiments it may be advantageous to perform tag decoding locally watermark decoding remotely, and in yet still other embodiments it may be advantageous to perform tag decoding remotely and watermark decoding remotely.
Outer codes were also tested to improve error detection and correction capability. In some embodiments, the codeword is constructed with an inner code of ‘soft’ analogue symbols in combination with a ‘hard’ outer code. In these embodiments the inner ‘soft’ symbols may be mers of length 5-16 nt and selected using minimum mutual absolute or Euclidean distance in DTW as a metric. The outer ‘hard’ code may include linear block codes, for example: cyclic codes (e.g. Hamming codes), repetition codes, parity codes, polynomial codes, Reed-Solomon codes, algebraic geometric codes, or Reed-Muller codes. The outer ‘hard’ code may also include convolutional codes and product (block turbo) codes.
In one example, codewords were constructed from kD=12-mer data symbols selected using a minimum mutual absolute distance in DTW threshold of 44.5 over F64. Data symbols from AD were arranged into an alternating Hamming [n, k] codeword where n=7 and k=4, and where each D was flanked by an S. This gives the outer code CD an error detection capacity of two symbols and error correction capacity of one symbol.
In other embodiments, the ‘soft’ analogue inner symbols are assembled into a codeword using a soft outer code. This soft outer code may include codes optimised for soft decoding such as a convolutional code, an LDPC code, or a turbo code.
In all embodiments, the outer code may be applied to the symbols of AD or the symbols of AS, or both the symbols of AD and AS, in an alternating codeword comprised of alternating symbols from AD and AS.
A similar scheme to using multiple fragments for a single message is one where we use a long outer code, such as a good NB-LDPC code. In this case, we first construct a codeword from the alphabet AD of length K(|AS|−1), where K is the number of codeword ‘segments’. Then this codeword is divided into K segments, each of length |AS|−1. The location of each segment in the long codeword is encoded using the spacer (or AS) alphabet. Since long codewords have better performance than shorter ones, a scheme like this can be expected to improve performance. But, once more, at least one read of each segment of data is used for decoding the outer code, which might impact the efficiency of the system. Note that the example with codewords of length K(|A2|−1) was just an example case, in general the outer code would be of length KL, with L<=AS|(K+1).
Here we disclose a method to include unnatural ‘Hachimoji’ or ‘AEGIS’ nucleotides into synthetic oligonucleotide tags to increase the information rate and give better data and spacer alphabet design flexibility. AEGIS nucleotides include the pyrimidine bases Z and S and the purine bases P and B, which form the complementary hydrogen bonding pairs Z:P and S:B. AEGIS bases may be used to expand the number of nucleotides used to encode information in an oligonucleotide from four to eight, and thereby increase the theoretical maximum information density from 2 bits nt-1 to 3 bits nt-1. Data presented in
For the purpose of generating the figures, first some sequences containing AEGIS bases were designed, and manufactured. Then, those were sequenced using a nanopore device, first without the unnatural AEGIS bases present for the PCR amplification, and then with dNTPs only. The raw signals resulting from the sequencing runs were then clustered based on pair-wise DTW distance, and a consensus signal was generated for each primary cluster using DTW Barycenter Averaging (DBA). The regions of the consensus signals that are generated by the sequences containing the AEGIS bases were found by first locating the regions for the adjacent sub-sequences that do not contain the AEGIS bases, once more using DTW distances.
The inclusion of AEGIS bases may be used to generate a larger range of different raw current signatures, and thereby permit greater flexibility in data and spacer alphabet design. For example, by using symbol selection methodologies disclosed previously, data alphabet symbols AD and spacer alphabet symbols AS may be generated at larger mutual DTW and/or COW distance which may increase decoding efficiency and reliability. Additionally, AEGIS bases may be used to design larger data |AD| and spacer alphabets |AS| for a given minimum mutual DTW and/or COW distance compared to the same size alphabets constructed from conventional nucleotides alone. This surprising result permits the design of nanopore encoding systems with greater flexibility, improved information density, and improved decoding and sequence identification reliability.
In cases where outer codes are not used, the best option may be to use a maximum likelihood (ML) or a ML-based approach using any suitable distance metric, such as DTW. The most suitable distance metrics may be those that are closest to actual probabilities.
In cases where outer codes are used, decoding would depend on which code, and which codeword length, is used. For short codes over a small alphabet, such as a (n, k), where n is the codeword length and k is the number of data symbols, for e.g. (7, 4) over F16, the DTW cost vectors obtained from decoding the inner code can be used for ML decoding of the outer code. For longer codes, or ones using larger alphabets, ML is not practical, in which case a more suitable decoder is used; e.g.: BP for LDPC, Chase-Pyndiah decoding for product codes, etc. If the outer code is hard decoded, then it would work with the ML estimates for each symbol obtained from inner decoding. Once more, the specific decoding algorithm would depend on the code; eg: Berlekamp algorithm for RS codes, iterative hard decoding with product codes, etc. A number of codes would perform reasonably well with BP decoding (hard or soft), but suitable parity-check matrices are first computed for them. Chase decoding is a good option for soft decoding any algebraic code.
Machine learning is an alternative approach that may be used for decoding. It may be used for data decoding, after the spacer decoding step in
To demonstrate our encoding approach using absolute distance in DTW to select AD, 500 symbols of each length kD=8, 10, 12, 14 and 16 were randomly generated within the following constraints:
The analogue current signatures of each kD length set of 500 symbols were then simulated using Scrappie software. Alphabets of size |AD|=16, 64 and 256 were then selected from the 500 simulated signatures using a minimum absolute distance in dynamic time warping (DTW) threshold of 59.5, 44.5 and 31.5, respectively (See Table 1). Error probabilities for template and complementary current signature for symbols in the F16 and F64 alphabets are given in
ID tags given below (ID_F16abs_001-012, ID_F64abs_001-004, and ID_F256abs_001-004) were synthesised by Macrogen and sequenced using the Oxford Nanopore MinION device and SQK-LSK109 protocol with R9.4.1 flowcells. The resulting raw analogue data in .fast5 file format was inputted into the decoder. Results for alphabets of size |AD|=16, 64, and 256 are given in Table 4, Table 5 and Table 6, respectively.
Results show that data symbol alphabets constructed using absolute distance in DTW outperformed those constructed using Euclidean distance in DTW, for |AD|<64.
F16, Absolute Distance, Spacer 1
F64, Absolute Distance, Spacer 1
F256, Absolute Distance, Spacer 1
To demonstrate our encoding approach using Euclidean distance in DTW to select AD, 500 symbols of each length kD=8, 10, 12, 14 and 16 were randomly generated within the following constraints:
The analogue current signatures of each kD length set of 500 symbols was then simulated using Scrappie software. Alphabets of size |AD|=16, 64 and 256 were then selected from the 500 simulated signatures using a minimum Euclidean distance in dynamic time warping (DTW) threshold of 6.8, 5.375 and 3.825, respectively (See Table 1). The sets of data symbol sequences for these F16, F64 and F256 alphabets selected using minimum Euclidean distance in DTW are given in Tables 11-16 and corresponding simulated current signatures di(t) are given in
ID tags listed below (ID_F16eu_001-012, ID_F64eu_001-004, and ID_F256eu_001-004) were synthesised by Macrogen and sequenced using the Oxford Nanopore SQK-LSK109 protocol and R9.4.1 flowcells. The resulting raw analogue data in .fast5 file format was inputted into the decoder. Results for alphabets of size |AD|=16, 64, and 256 are given in Table 7Error! Reference source not found, Table 8, and Table 9, respectively.
Results show that data symbol alphabets constructed using Euclidean distance in DTW outperformed those constructed using absolute distance in DTW, for |AD|>64.
F16, Euclidean Distance, Spacer 1
F64, Euclidean Distance, Spacer 1
F256, Euclidean Distance, Spacer 1
To demonstrate the use of two alphabets to encode data, ID tags were assembled from alternating symbols from two different alphabets, AD and AS, where |AS|=2 and CS is the spacer configuration. As described previously, two alphabets may be used to increase the data rate r (bits nt−1), distribute information across multiple different oligonucleotide fragments, or identify hidden information in an oligonucleotide watermark. In the following example, ID tags were constructed using the following alphabets:
Specifically, the following ID tags that include spacer configurations CS encoding data were constructed:
Analogue output from the ID tag sequences above (ID1-ID10) is given in
To demonstrate the use of unnatural AEGIS modifications to improve symbol selection, four ID tags (ID_AEGIS_1-4) were manufactured with conventional DNA nucleotides from the set {A, C, G, T} and one or more AEGIS nucleotides from the set {P, Z, B, S}. These tags were manufacture by Firebird Biomolecular Science LLC, amplified with Phire Hotstart II DNA polymerase and ONT rapid attachment primers from the kit SQK-PBK004 in the presence of conventional free nucleotides only (dNTPs), and conventional and AEGIS free nucleotides (dXTPs). Samples were sequenced on an Oxford Nanopore MinION device using the SQK-PBK004 protocol and R9.4.1 flowcells.
Each sequence ID_AG_1-4 was amplified separately in the presence of dNTPs and dXTPs. When amplification was performed in the presence of dNTPs, any one of {A, C, G, or T} may amplified into position adjacent to an AEGIS base {Z, P, B, S} although bias towards C and T replacing Z, and G and A replacing P was observed.
The raw signals resulting from the sequencing runs were then clustered based on pair-wise DTW distance, and a consensus signal was generated for each primary cluster using DTW Barycenter Averaging (DBA). The regions of the consensus signals that are generated by the sequences containing the AEGIS bases were found by first locating the regions for the adjacent sub-sequences that do not contain the AEGIS bases, once more using DTW distances.
Table 10 gives the distance in DTW between sequences amplified in the presence of dNTPs and dXTPs. In all cases, tags amplified in the presence of dXTPs generated unique raw nanopore current signatures which were clearly detectable, in terms of DTW distance, from the same sequence amplified in the presence of dNTPs only. A visual inspection of
Table 11-Table 16 below provide alphabet sequences, which relate to the examples above with the following relationship between the examples and the sequence listing:
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the above-described embodiments, without departing from the broad general scope of the present disclosure. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
Number | Date | Country | Kind |
---|---|---|---|
2020903611 | Oct 2020 | AU | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/AU2021/051162 | 10/6/2021 | WO |