This U.S. patent application claims priority under 35 U.S.C. § 119 to Indian application No. 202321043119, filed on Jun. 27, 2023. The entire contents of the aforementioned application are incorporated herein by reference.
The disclosure herein generally relates to the field of preserving privacy of a genome, and, more particularly, to a method and system for computing k-mer abundance histogram on fully homomorphic encrypted genomic data in a privacy preserving manner.
A genome is an organism's complete set of Deoxyribonucleic Acid (DNA), that includes enormous genetic information. The genome is a collection of DNA molecules storing genetic information in a cell. It can be represented as a set of long strings over the alphabet (each string corresponding to one chromosome). In the DNA sequencing experiment, many reads are produced from the genome. These reads are short substrings obtained from random locations of the genome. Genome size and some other characteristics estimates are computed from a summary statistic of reads i.e., a k-mer abundance histogram. A k-mer is a substring of length exactly k and the histogram summarizes the number of occurrences of individual k-mers in the input set of reads. The histogram gives insights on the approximate total number of distinct k-mers and total number of k-mers with some multiplicity (frequency) while preserving the privacy of critical genomic data.
K-mer abundance histogram estimation is a critical constituent and has several applications in genome analysis. For instance, k-mer abundance information in sequence data is useful in read error correction, parameter estimation for genome assembly, digital normalization, sampling error rate etc. However, existing k-mer abundance estimation techniques do not address the problem of data privacy. Data privacy is important especially in fields like genome analysis which involve sensitive data of a subject such as genetic testing information like DNA or any personal health information, which when leaked results in irreversible damage for the subject. Further, with the growing use of large-scale datasets containing the subject's genomics and clinical data for research and studies purposes, it is important to ensure the privacy of the subject by generating a secure representation of genome data observation.
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for computing abundance histogram on a fully homomorphic encrypted genomic data in a privacy preserving manner is provided. The processor-implemented method comprises receiving, via an input/output interface, a genome sequence as an input, wherein the genome is a string of a plurality of nucleotides. The input genome is divided into a series of chunks defined as sketches. Further, the processor-implemented method comprises estimating one or more distinct nucleotide strings for the received input genome sequence and one or more nucleotide strings with multiplicity using a privacy preserving kmerlight technique. Each of the one or more distinct nucleotide strings are hashed in each of the one or more sketches using a randomized hash function. A multilevel sampling is performed for each nucleotide string which involves dividing these strings into subsets based on length of a prefix. Furthermore, the processor-implemented method comprises determining a sampling level and a counter value of each nucleotide string based on the hash value of the nucleotide. The sampling level and the counter value of each nucleotide string is encrypted. The encrypted sampling level and the counter value are sent to a predefined server. Wherein, the predefined server checks condition for collision for the encrypted input data to update the matrix using the encrypted sampling level and counter value.
Further, the processor-implemented method comprises determining an optimal sampling level and associated optimal counter value for each multiplicity in the matrix. Wherein the optimal sampling level is the level which has the most occurrence of a multiplicity in the matrix. Further, an approximation of the one or more distinct nucleotide strings are determined for the given input genome and the one or more distinct nucleotide strings with a certain multiplicity based on the determined optimal sampling level and an associated optimal counter level. Finally, computing an abundance histogram on a fully homomorphic encrypted genomic data in privacy preserving manner based on the determined approximation of the total number of distinct nucleotide strings.
In another aspect, a system for computing abundance histogram on fully homomorphic encrypted genomic data in a privacy preserving manner is provided. The system comprises a memory storing a plurality of instructions and one or more Input/Output (I/O) interfaces to receive a genome sequence as an input, wherein the genome is a string of a plurality of nucleotides. Further, the system comprises one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to divide the string of the plurality of nucleotides into one or more sketches based on a predefined sketch size. Further, the one or more hardware processors are configured to estimating, via the one or more hardware processors, one or more distinct nucleotide strings for the received input genome sequence and one or more nucleotide strings with multiplicity using a privacy preserving kmerlight technique. Furthermore, the one or more hardware processors are configured to hash each of the one or more distinct nucleotide strings in each of the one or more sketches using a randomized hash function, wherein the hash value is used to determine a sampling level and a counter value of nucleotide string.
Further, the one or more hardware processors are configured to perform a multilevel sampling for each nucleotide string which involves dividing these strings into subsets based on length of a prefix, wherein the prefix comprises one or more trailing zeros and one or more leading zeros in the hash value of nucleotide. Further, the one or more hardware processors are configured to determining, via the one or more hardware processors, a sampling level and a counter value of each nucleotide string based on the hash value of the nucleotide, wherein a matrix of the sampling level and counter value is prepared for each of the one or more sketches. The matrix (one matrix created for the whole genome) is updated using the sampling level and counter value for each nucleotide string in each sketch. Furthermore, the one or more hardware processors are configured to encrypting, via the one or more hardware processors, the sampling level, and the counter value of each nucleotide string, wherein the encrypted sampling level and the counter value is sent to a predefined server, wherein the predefined server checks condition for collision for the encrypted input data to update the matrix of the encrypted sampling level and counter value, where in the matrix also called as sketch matrix is also encrypted.
Furthermore, the one or more hardware processors are configured to determine an encrypted optimal sampling level and associated encrypted optimal counter value for each multiplicity from the encrypted matrix, wherein the optimal sampling level is the level which has the most occurrence of a multiplicity in the matrix. Furthermore, the one or more hardware processors are configured to determine an encrypted approximation of the one or more distinct nucleotide strings for the given input genome and the one or more distinct nucleotide strings with a certain multiplicity based on the determined encrypted optimal sampling level and the associated encrypted optimal counter level. Finally, the one or more hardware processors are configured to compute an abundance histogram on a fully homomorphic encrypted genomic data in privacy preserving manner based on the determined approximation of the total number of distinct nucleotide strings.
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a method for computing abundance histogram on a fully homomorphic encrypted genomic data in a privacy preserving manner is provided. The processor-implemented method includes receiving, via an input/output interface, a genome sequence as an input, wherein the genome is a string of a plurality of nucleotides. Furthermore, the processor-implemented method comprises estimating one or more distinct nucleotide strings for the received input genome sequence and one or more nucleotide strings with multiplicity using a privacy preserving kmerlight technique. Each of the one or more distinct nucleotide strings are hashed in each of the one or more sketches using a randomized hash function. A multilevel sampling is performed for each nucleotide string which involves dividing these strings into subsets based on length of a prefix. Further, the processor-implemented method comprises determining a sampling level and a counter value of each nucleotide string based on the hash value of the nucleotide. The sampling level and the counter value of each nucleotide string is encrypted. The encrypted sampling level and the counter value are sent to a predefined server. Wherein, the predefined server checks condition for collision for the encrypted input data to update the matrix of the sampling level and counter value.
Further, the processor-implemented method comprises determining an optimal sampling level and associated optimal counter value for each multiplicity in the matrix. Wherein the optimal sampling level is the level which has the most occurrence of a multiplicity in the matrix. Further, an approximation of the one or more distinct nucleotide strings are determined for the given input genome and the one or more distinct nucleotide strings with a certain multiplicity based on the determined optimal sampling level and an associated optimal counter level. Finally, computing an abundance histogram on a fully homomorphic encrypted genomic data in privacy preserving manner based on the determined approximation of the total number of distinct nucleotide strings.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, explain the disclosed principles:
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
A genome is typically a string of nucleotides (Adenine (A), Cytosine (C), Guanine (G), and Thymine (T)). For example, ATGCGTCTC . . . G. Genomes are generally large biological sequence strings, and these are rendered into a stream of sub-strings of suitable length called k-mers and are used for computational genomics and sequence analysis. One such critical analysis is k-mer abundance estimation. Given an input genome sequence, k-mer abundance estimation algorithm provides estimations such as the total number of distinct k-mers in the sequence and total number of k-mers with frequency/multiplicity i (referred to as Fi). There are several approaches in literature that can compute these estimates. However, these approaches do not address the privacy of the genome sequences.
Embodiments herein provide a method and system for computing abundance histogram on a fully homomorphic encrypted genomic data in a privacy preserving manner. The system is configured to estimate abundance of nucleotide strings on the encrypted nucleotide strings using a privacy preserving kmerlight algorithm. Kmerlight is a streaming algorithm, which in general, processes a sequence of k-mers in a single pass using only a limited amount of memory and time. The kmerlight maintains an approximate summary, or a sketch, of the previously viewed k-mers and with each new k-mer the sketch is updated. When all the k-mers are processed, the sketch can be analyzed to provide the estimate of the k-mer abundance histogram. Kmerlight combines the techniques of sampling and hashing to maintain a sketch of k-mers and from the contents of the sketch computes an estimate of the histogram.
Referring now to the drawings, and more particularly to
In an embodiment, the network 106 may be a wireless or a wired network, or a combination thereof. In an example, the network 106 can be implemented as a computer network, as one of the different types of networks, such as virtual private network (VPN), intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each other. Further, network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices. The network devices within network 106 may interact with the system 100 through communication links.
The system 100 supports various connectivity options such as BLUETOOTH®, USB, ZigBee, and other cellular services. The network environment enables connection of various components of the system 100 using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system 100 is implemented to operate as a stand-alone device. In another embodiment, the system 100 may be implemented to work as a loosely coupled device to a smart computing environment. Further, the system 100 comprises at least one memory with a plurality of instructions, one or more databases 112, and one or more hardware processors 108 which are communicatively coupled with the at least one memory to execute a plurality of modules 114 therein. The components and functionalities of the system 100 are described in further detail.
Further, a privacy enhancing technique is designed over the representations using a Fully Homomorphic Encryption (FHE). Genomic data is analyzed with identified efficient algorithms for real world deployment in encrypted domain. Analysis of k-mers, which are nucleotide strings of length k present in a genome sequence, is one of the fundamental operations in computational genomics. In particular, the problem of computing k-mer abundance estimation is considered in a genomic sequence.
Initially, at step 302 of the processor-implemented method 300, one or more hardware processors 108 are configured by the programmed instructions to receive a genome sequence as an input. The genome is a string of a plurality of nucleotides also called k-mers. Hereinafter, the string of a plurality of nucleotides is used interchangeably as k-mers.
At the next step 304 of the processor-implemented method 300, the one or more hardware processors 108 are configured by the programmed instructions to divide the string of the plurality of nucleotides into one or more sketches based on a predefined sketch size. In the sketching technique the input genome is subdivided into a series of chunks defined as sketches. Sketching is a method used in getting a simplified and summarized representation of large data sets that preserves important information while discarding the unnecessary details in a stream of data.
In one example, a user has a large stream of numbers, and the user wants to compute the average value of the entire list. The system is configured to generate reads from the bigger stream of numbers and create a sketch for each read that stores only a few representation numbers from the bigger list. The system can compute the average of each sketch and use these averages to compute the overall average that can provide a rough average estimate for the entire list.
Each sketch has a state matrix of a level and a counter with the size which is configurable such as level 64 and the counter=220. Each cell of the state matrix is updated. Further, the level number and counter value for each k-mer is obtained from the hash function in a sketch. The cell corresponding to this level number and counter number is then updated for the state matrix after checking if there exists no collision, where collision is a property where two k-mers have same level and counter number.
At the next step 306 of the processor-implemented method 300, the one or more hardware processors 108 are configured by the programmed instructions to estimate one or more distinct k-mers for the received input genome sequence and one or more k-mers with multiplicity using a privacy preserving kmerlight technique. The one or more distinct k-mers i.e., F0 is computed as follows:
The one or more distinct k-mers are estimated using approximation formula provided, and to approximate nucleotides with multiplicity i, an optimal sampling level and counter number is determined from the sketch matrix and these numbers are used in the approximation formula to get the result as shown in
At the next step 308 of the processor-implemented method 300, the one or more hardware processors 108 are configured by the programmed instructions to hash each of the one or more distinct k-mers in each of the one or more sketches using a randomized hash function. Wherein the hash value is used to determine a sampling level and a counter value of k-mer. Each input k-mer is hashed using a randomized hash function: hash_val←H(k-mer).
Each k-mer can be hashed using any secure randomized hash function. In one aspect, the hash function used by kmerlight algorithm is murmur3 hash. It is non-cryptographic hash function suitable for general hash-based lookup table. Murmur3 hash yields a 32-bit or 128-bit hash values. It is optimized based on the system architecture.
At the next step 310 of the processor-implemented method 300, the one or more hardware processors 108 are configured by the programmed instructions to perform a multilevel sampling for each nucleotide string which involves dividing these strings into subsets based on length of a prefix, wherein the prefix comprises one or more trailing zeros and one or more leading zeros in the hash value of nucleotide. In multi-level sampling, k-mers are segregated into levels, based on the length of the prefix say w, of hash values. For each level w, counters are maintained that determine the number of times each k-mer has occurred. For example,
At the next step 312 of the processor-implemented method 300, the one or more hardware processors 108 are configured by the programmed instructions to determine a sampling level and a counter value of each nucleotide string based on the hash value of the nucleotide, wherein a matrix of the sampling level and counter value is prepared for each of the one or more sketches. In the hash value for a k-mer obtained from murmur3 hash function, the number of trailing zeros in the hash value will give the sampling level number. The counter number is computed as:
At the next step 314 of the processor-implemented method 300, the one or more hardware processors 108 are configured by the programmed instructions to encrypt the sampling level and the counter value of each nucleotide string. The encrypted sampling level and the counter value is sent to a predefined server, wherein the predefined server checks condition for collision for the encrypted input data to update the matrix of the sampling level and counter value.
Note that the state-of-the-art data structures can handle only encryption of larger integers by increasing the parameter size. In data heavy applications, increasing the parameter size will make the application inefficient. In the present disclosure, a data structure called BigCipher is suitable for encrypting large integers, with a minimal parameter set. The BigCipher is like a big integer structure, a radix-based implementation in base b, which can accommodate large integer encryption. The BigCipher is designed to overcome the limitations existing in the current FHE library, that can only accommodate encryption of short integers. This data structure is especially important in the applications dealing with streaming inputs of larger sizes that need to be encrypted.
Further, the one or more hardware processors 108 are configured by the programmed instructions to approximate logarithm and exact computation of exponentiation of 2 in encrypted domain which can compute exponentiation of large integers without increasing the parameter set. This exponentiation function uses the idea of shifting. Usually, in FHE libraries, when exponentiation is computed for larger integers, the parameter sets need to be increased, which in turn impacts the performance adversely. This hinders performance and becomes inefficient as the size of integer grows. However, in the present disclosure, the function can compute exponentiation of large integers without increasing the parameter sets. The input is encoded using base-4 encoding and independently encrypted into a BigCipher data structure. Herein, an encrypted index of the first non-zero coefficient of the input in the BigCipher is extracted. For example, there is a number n=123, representation of n in base ‘4’ is shown below (size of 16 bits, which is enough to support 416−1 numbers):
For FHE, polynomial approximation is used and then approximation is applied. To compute non-linear function over FHE, the state-of-the-art techniques uses polynomial approximations like taylor-series approximation technique. The main drawback of this technique is that it is required to know the range of input, because the input has to be scaled to a smaller range for more accurate approximations. One such use case is logarithm approximation, where input is needed to be scaled down to (0,1]. Using bigCipher, the system doesn't need to scale down the input however large it may be and yet the system can still compute logarithm approximation with an average error of 0.05 and maximum error of 0.25. It is also important to note that this logarithm approximation algorithm provided by us is applied with a small FHE parameter set and provides a much better accuracy than polynomial approximation methods for large integers. It would be appreciated if it could work with a small FHE parameter set and is accurate for large integers. Logarithmic approximation with an average error of 0.05 and a maximum error of 0.25.
At the next step 316 of the processor-implemented method 300, one or more hardware processors 108 are configured by the programmed instructions to determine an optimal sampling level and associated optimal counter value for each multiplicity in the matrix. The optimal sampling level is the level which has the most occurrence of a multiplicity in the matrix.
To compute the optimal sampling level the following steps are performed. w-optimal is the optimal sampling level and t-optimal is the optimal counter value with initial value as 0. t is used to compute t-optimal, wherein t=number of k-mers with multiplicity i present in a row. For each sampling level w, if t>t-optimal, w-optimal=w, t-optimal=t. Using these last t-optimal and w-optimal values, the system is configured to compute the F0 and Fi values.
At the next step 318 of the processor-implemented method 300, the one or more hardware processors 108 are configured by the programmed instructions to determine an approximation of the one or more distinct k-mers for the given input genome and the one or more k-mers with a certain multiplicity based on the determined optimal sampling level and an associated optimal counter level as shown in Table 1.
Finally, at the last step 320 of the method 300, the one or more hardware processors 108 are configured by the programmed instructions to compute an abundance histogram on a fully homomorphic encrypted genomic data in privacy preserving manner based on the determined approximation of the total number of distinct nucleotide strings. Fully Homomorphic Encryption (FHE) enables computations on encrypted data without the need for decryption, thereby preserving privacy of the input data. For a set of FHE ciphertext corresponding to a set of plaintexts, any arbitrary function can be evaluated without revealing the plaintexts. FHE supports addition and multiplication as primitive operations:
A public key FHE scheme ξ consists of an additional eval ξ along with the usual (KeyGen ξ, Enc ξ, Dec ξ) from any other public key scheme. Eval & is the evaluation algorithm used for computations on encrypted data. This algorithm takes as input a polynomial expression P and a set of ciphertext sc={C0, C1, . . . , Cn} as inputs to P. The input output of eval ξ satisfies following equation:
To improve the efficiency of the homomorphic operations and to reduce space complexity, one can leverage homomorphic batching technique where multiple plaintexts is batched into a single ciphertext. On this batched ciphertext, the operations can be performed on component wise plaintexts and can be executed parallelly in single Instruction Multiple Data (SIMD) manner.
For a radix-based implementation of a number in base b, the logarithm computes the log value of the number in base ‘b’. Logarithm is expressed as sum of mantissa and exponent, where mantissa is the integer part and exponent are the decimal part. To compute mantissa, the index of the last non-zero coefficient of the radix representation of number is found in zero based indexing. To compute the decimal, the values from the last non-zero index are copied to evaluate it as in base b representation of decimal number. To compute log (n) in base 4:
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The embodiments of the present disclosure herein address unresolved problems of data privacy especially in genome analysis which involve sensitive data of a subject such as genetic testing information. This disclosure relates generally to a method and system for computing abundance histogram on fully homomorphic encrypted genomic data in a privacy preserving manner. privacy enhancing technique is designed over the representations using a Fully Homomorphic Encryption. Genomic data is analyzed with identified efficient algorithms for real world deployment in encrypted domain. Analysis of k-mers, which are nucleotide strings of length k present in a genome sequence, is one of the fundamental operations in computational genomics. In particular, the problem of computing k-mer abundance estimation is considered in a genomic sequence.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
202321043119 | Jun 2023 | IN | national |