Illustrated embodiments generally relate to data processing, and more particularly to malicious sequence detection for gene synthesizers.
Bioinformatics is an interdisciplinary field where software programs are developed to process and understand biological data. Bioinformatics is used to understand the protein sequences at a greater level of detail. With innovations in modern molecular biology, synthesizing such protein sequences is relatively easier. Software programs may be used to understand the protein sequences, and identify a specific protein sequence and synthesize. When the software programs provide access to the protein sequences without restriction, there is a possibility of a potential abuse of the software program to identify and synthesis a malicious sequence such as an epidemic virus or bacteria. If there is a slight variation in the protein sequences, the software program may not be able to identify the malicious sequence. Thus it is challenging to provide software programs with access to protein sequences for analysis and to identify a varied malicious sequence, and also restrict synthesis of the malicious sequences.
The claims set forth the embodiments with particularity. The embodiments are illustrated by way of examples and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. Various embodiments, together with their advantages, may be best understood from the following detailed description taken in conjunction with the accompanying drawings.
Embodiments of techniques for malicious sequence detection for gene synthesizers are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of the embodiments. A person of ordinary skill in the relevant art will recognize, however, that the embodiments can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In some instances, well-known structures, materials, or operations are not shown or described in detail.
Reference throughout this specification to “one embodiment”, “this embodiment” and similar phrases, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one of the one or more embodiments. Thus, the appearances of these phrases in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Deoxyribonucleic (DNA) and ribonucleic acid (RNA) are nucleic acids that express genes associated with living organisms. Artificial gene synthesis is a method used to create artificial genes in laboratory based on the DNA and RNA. Translation is a process by which a protein is synthesized from information contained in RNA. Sequence may be either a DNA/RNA sequence or a protein sequence. For example, DNA/RNA sequencing determines the sequence of individual genes. Sequence may be represented as alphabets.
Isolation process 118 in
The sequence of interest or gene is provided to gene synthesizer 120. Gene synthesizer 120 may be a combination of hardware and software application, enabling synthesis of DNA, RNA, etc. In the comparison process 122 in the gene synthesizer 120, the isolated gene sequence is translated using an encoding mechanism such as base 4 encoding, so that the sequence is in a compact form for analysis. Any other encoding mechanisms such as UTF may lead to four times longer string sequence. The sequence translated using the base 4 encoding format is represented as bit binary encoding. Sliding window technique is used to parse the bit binary encoding. Sliding window technique is used to parse two bits at a time i.e., one character at a time. The parsed bit sequence is input to a locality sensitive hasher (LSH). Sliding window approach is used to parse a portion of sequence or a set of bits, and add it to an array of bucket. The parsed bit sequence is compared with the previously stored sequence or set of bits to determine a match. If a match is determined, number of times the match occurred is also stored. Hash of the sequence of interest is generated based on bit binary encoding, array of bucket, etc.
The generated hash is compared with a list of malicious hashes corresponding to malicious sequences to identify a match. Based on the extent of match a similarity score is computed, and a result with similarity score 124 is displayed in a user interface. If the similarity score is above a threshold score, the sequence of interest or gene is determined to be malicious and is sent for further analysis. The threshold score may be a user-defined threshold or pre-defined threshold score that can be dynamically varied before analysis of sequences. The sequence of interest is prevented from being synthesized. Since the original malicious sequences are not stored in any database, users may not have direct access to the malicious sequences. Thus legal requirements are complied. Even if the sequence of interest is a variant such as phenotype of any malicious sequence, the LSH is capable of identifying them.
Consider a portion of sequence ‘ACGU’ 212, the base 4 encoding corresponding to this portion is ‘0123’ 214, and the bit binary encoding is ‘00011011’ 216. The bit binary encoding ‘00011011’ 216 is 8 bits long, and these 8 bits represent one byte. Sliding window technique is used to perform byte level parsing of the binary encoding ‘00011011’ 216. The sequence ‘ACGU’ 212 represented by bit binary encoding ‘00011011’ 216 has to be parsed one character at a time. But in sliding window technique of byte level parsing, the sequence ‘ACGU’ 212 represented by bit binary encoding ‘00011011’ 216 is parsed four characters at a time. Therefore, when the bit binary encoding ‘00011011’ 216 is parsed using the sliding window technique, the bit binary encoding ‘00011011’ 216 is shifted by 2bits, as shown in 218. The binary encoding ‘00011011’ 218 is shifted by 2bits ‘00’, and the next 2bits ‘00’ corresponding to character ‘A’ 220 in sequence ‘C’ 202 is concatenated at the end of the bit binary encoding as shown in 222. The sliding window parses or slides the binary encoding ‘01101100’ 222. The bit binary encoding ‘01101100’ 222 is shifted by 2bits ‘01’, and the next 2 bits ‘11’ corresponding to the character ‘U’ 224 in sequence ‘C’ 202 is concatenated at the end of the bit binary encoding as shown in 226. The sliding window parses or slides the binary encoding ‘10110011’ 226. The binary encoding ‘10110011’ 226 is shifted by 2bits ‘10’, and the next 2bits ‘00’ corresponding to character ‘A’ 228 in sequence ‘C’ 202 is concatenated at the end of the binary encoding as shown in ‘11001100’ 230. The sliding window parses or slides the bit binary encoding ‘1100100’ 230. Alternating between sliding the bit binary encoding and shifting two bits, results in parsing two bits at a time i.e., one character from the sequence at a time. This process continues until the complete sequence is parsed.
The parsed binary encoding string is an input to locality sensitive hasher (LSH). LSH identifies similarities between objects using probability distributions over hash functions. Similar inputs are likely to have same or similar hashes. Accordingly, even if the sequences vary slightly or if the sequences are similar, the sequences are likely to have similar hashes. Various algorithms or hash functions may be used in LSH. In the illustration below, ternary locality sensitive hashing (TLSH) function may be used. In the TLSH function, sliding window approach is used to slide or parse a sequence of 5 bytes i.e., 20 characters at a particular instance or time to populate an array of bucket. The parsed sliding window content is compared with previously stored sliding window content in the array of bucket to determine if a match may be identified. If the parsed sliding window content does not match the previously stored sliding window content in the array of bucket, the contents of the parsed sliding window content is added to a new bucket in the array of bucket, and parsing of the binary bit encoding using the sliding window is continued.
If parsed window content matches any entry in the array of bucket, number of times the match is identified is also determined in the array of bucket count. This process is iteratively continued until the bit binary encoding is parsed using the sliding window approach, and the contents of the sliding window are added to the array of bucket. The Quartiles of the array of bucket are computed such that 75% of the array bucket counts are greater than or equal to first quartile (q1), 50% of the array bucket counts are greater than or equal to second quartile (q2), and 25% of the array bucket counts are greater than or equal to third quartile (q3). Quartile is a type of quantile, where q1 is defined as the middle number between the smallest array of bucket count and median of the array of bucket counts. Q2 is defined as the median of the bucket counts. Q3 is the middle value between the median and the highest value of the bucket counts. Hash is generated based on the bit binary encoding, quartiles q1, q2q3, array of bucket, etc., as shown in hash ‘H2’ 232 in
Consider a malicious RNA sequence ‘R’ 234, and a hash ‘H1’ 236 generated for the sequence ‘R’ 234 as shown in
LSH 308 parses the bit sequences, and generates a hash value referred to as LSH value 310. The LSH value 310 may be generated for a complete sequence or a portion of sequence or sub-sequence. The generated LSH value 310 is compared with a list of malicious hashes corresponding to malicious sequences to identify a match. Based on the extent of match, a similarity score is computed. A user or an application may define a threshold of similarity score. If the computed similarity score is above the user-defined threshold of similarity score, the sequence of interest or gene is determined to be malicious, and is sent to output queue for critical/rejected sequences 312. If the computed similarity score is below the user-defined threshold of similarity score, the sequence of interest or gene is determined or identified to be non-malicious, and is sent to output queue for acceptable sequences 314.
Some embodiments may include the above-described methods being written as one or more software components. These components, and the functionality associated with each, may be used by client, server, distributed, or peer computer systems. These components may be written in a computer language corresponding to one or more programming languages such as functional, declarative, procedural, object-oriented, lower level languages and the like. They may be linked to other components via various application programming interfaces and then compiled into one complete application for a server or a client. Alternatively, the components maybe implemented in server and client applications. Further, these components may be linked together via various distributed programming protocols. Some example embodiments may include remote procedure calls being used to implement one or more of these components across a distributed programming environment. For example, a logic level may reside on a first computer system that is remotely located from a second computer system containing an interface level (e.g., a graphical user interface). These first and second computer systems can be configured in a server-client, peer-to-peer, or some other configuration. The clients can vary in complexity from mobile and handheld devices, to thin clients and on to thick clients or even other servers.
The above-illustrated software components are tangibly stored on a computer readable storage medium as instructions. The term “computer readable storage medium” should be taken to include a single medium or multiple media that stores one or more sets of instructions. The term “computer readable storage medium” should be taken to include any physical article that is capable of undergoing a set of physical changes to physically store, encode, or otherwise carry a set of instructions for execution by a computer system which causes the computer system to perform any of the methods or process steps described, represented, or illustrated herein. Examples of computer readable storage media include, but are not limited to: magnetic media, such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute, such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs) and ROM and RAM devices. Examples of computer readable instructions include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment may be implemented using Java, C++, or other object-oriented programming language and development tools. Another embodiment may be implemented in hard-wired circuitry in place of, or in combination with machine readable software instructions.
A data source is an information resource. Data sources include sources of data that enable data storage and retrieval. Data sources may include databases, such as relational, transactional, hierarchical, multi-dimensional (e.g., OLAP), object oriented databases, and the like. Further data sources include tabular data (e.g., spreadsheets, delimited text files), data tagged with a markup language (e.g., XML data), transactional data, unstructured data (e.g., text files, screen scrapings), hierarchical data (e.g., data in a file system, XML data), files, a plurality of reports, and any other data source accessible through an established protocol, such as Open Data Base Connectivity (ODBC), produced by an underlying software system (e.g., ERP system), and the like. Data sources may also include a data source where the data is not tangibly stored or otherwise ephemeral such as data streams, broadcast data, and the like. These data sources can include associated data foundations, semantic layers, management systems, security systems and so on.
In the above description, numerous specific details are set forth to provide a thorough understanding of embodiments. One skilled in the relevant art will recognize, however that the embodiments can be practiced without one or more of the specific details or with other methods, components, techniques, etc. In other instances, well-known operations or structures are not shown or described in detail.
Although the processes illustrated and described herein include series of steps, it will be appreciated that the different embodiments are not limited by the illustrated ordering of steps, as some steps may occur in different orders, some concurrently with other steps apart from that shown and described herein. In addition, not all illustrated steps may be required to implement a methodology in accordance with the one or more embodiments. Moreover, it will be appreciated that the processes may be implemented in association with the apparatus and systems illustrated and described herein as well as in association with other systems not illustrated.
The above descriptions and illustrations of embodiments, including what is described in the Abstract, is not intended to be exhaustive or to limit the one or more embodiments to the precise forms disclosed. While specific embodiments of, and examples for, the one or more embodiments are described herein for illustrative purposes, various equivalent modifications are possible within the scope, as those skilled in the relevant art will recognize. These modifications can be made in light of the above detailed description. Rather, the scope is to be determined by the following claims, which are to be interpreted in accordance with established doctrines of claim construction.