The present invention relates generally to lossy compression of a network that can be quickly and fully or partially decompressed (fully restored), and more specifically to a method for reduction of required computer resources when using a very large network that comprises a similarity graph such as Protein Connectivity Network (PCN).
In a network that comprises a similarity graph there is similarity between neighboring nodes above a predefined threshold (e.g. 60%). An example of such similarity graph is a Protein Connectivity Network (“PCN”). A PCN is a graph that can be used in order to solve different problems of computational biology, mainly to assist in the prediction of protein structure and functionality. The PCN consists of nodes that are small fragments of protein sequences, and an edge between nodes reflects high similarity between fragments. Each node is described by an index, the protein it belongs to, and the offset of that protein.
If a protein database contains over 320,000 proteins, that builds up to more than 4.5×107 nodes and over 4.7×108 edges. The size of the graph requires massive storage space and executing queries is time consuming. For perspective, the STRING Consortium database presently has a collection of 9.6 million proteins covering a mere 2031 organisms. Now consider that here we are examining peptide fragments falling within certain selected ranges of amino acid residue length, and that each position can be any of 25 amino acid, and the fragments are potentially from random parts, with random overlaps of an otherwise unknown protein that we are seeking to characterize functionally. It's easy to see how massively complex, and resource-draining, a query to a naïve network can be.
It is an object of the disclosed technique to provide a novel method for network lossy compression.
In accordance with the disclosed technique, there is thus provided a method of compressing a network characterized by nodes with a plurality of repetitions of characters sequences and a plurality of edges. Each edge in the network is connecting a pair of nodes based on a first similarity threshold. The method comprising clustering of said nodes according to a second similarity threshold, that is higher than said first similarity threshold.
According to some embodiments of the present invention, the method is for a network that is a Protein Connectivity Network (“PCN”).
According to some other embodiments of the present invention, the method further comprising the following steps: calculating similarity value between the nodes of each edge to identify nodes having similarity above the second similarity threshold value and performing the following steps for the identified nodes: (i) confirming whether the identified nodes are associated to a cluster; (ii) creating new clusters for identified nodes not previously associated to a cluster, and assigning the new cluster as root cluster; (iii) adding an unassociated node to the root cluster of an associated node in case only one node is associated to a cluster, as long as the number of nodes associated to the root cluster of the associated node is less than a predefined value; and (iv) merging two root clusters of the nodes of edge into one of the clusters in case the two nodes are associated to different root clusters, and sum of numbers of nodes associated to these root clusters is less than the predefined value.
According to some other embodiments of the present invention the method further comprising creating an empty dynamic list for cluster entries, each cluster entry comprising a pointer variable pointing to a parent cluster and a node number variable corresponding to the number of nodes in a cluster; (a) creating a list of node entries, each node entry comprising a variable indicating the cluster number of the node is assigned to, and variable is initialized as unassigned; (b) the creating of new cluster further includes: defining a new entry in the list of clusters, as a root cluster with number of nodes equal two; and (ii) associating both nodes to the root cluster by associating corresponding nodes entry in the list of nodes to the root cluster. (c) the adding of unassociated node to the root cluster of an associated node further includes: (i) searching for the root cluster of the node already associated to this cluster; (ii) updating the corresponding pointer to the parent cluster in the list of clusters to point to the root cluster; and (iii) adding the node to the cluster when the adding of the unassociated node to the root cluster doesn't exceed the predefined value by increasing the number of nodes in the corresponding entry in the list of clusters by one; and updating the cluster that the node is associated to, in the corresponding entry of the node.
(d) the merging of two root clusters into one of the clusters further includes
According to some other embodiments of the present invention the method further comprising: a) calculating amount of root clusters; b) renumbering the clusters; c) associating of nodes with new numbers of clusters (after renumbering); d) creating an output file of content of clusters; and e) building connections between the clusters for each said edge where nodes of said edges are connected to different clusters.
The disclosed technique will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings in which:
As mentioned hereinabove, it is an object of the disclosed technique to provide a novel method for network lossy compression. There is thus provided a method of compressing a network characterized by nodes with a plurality of repetitions of characters sequences and a plurality of edges. Each edge in the network is connecting a pair of nodes based on a first similarity threshold. The method comprising clustering of said nodes according to a second similarity threshold, that is higher than said first similarity threshold.
The present invention provides an implementation of an efficient platform to execute queries on a reduced network, thus allowing researchers around the globe to use the network in their own research easily and quickly. The reduced network is generated by using compression techniques, such as multilevel approaches based on graph clustering, while allowing an efficient way to quickly restore it (fully or partially) for use in queries and for navigational needs.
A network, as used herein means a “similarity network” characterized by nodes having some attributes (for examples, words of some text, coordinates, and so on) and with some function defined on these attributes allowing to calculate a similarity (or distance) between each pair of nodes (for example, hamming distance for words, Euclidian distance for coordinates and so on); and plurality of edges, each edge connecting a pair of said nodes based on some similarity (or distance) threshold.
The term “node” or “sequence fragment” or “sub sequence” refers hereinafter to a sequence of characters.
As used herein, the term “protein fragment” refers hereinafter to a protein sequence or a part thereof comprising less than about 25 amino acids, and preferably between about 15 to 25 amino acids, and more particularly about 20 amino acids.
The term “root cluster” means that the cluster does not point to (i.e. does not included into) another cluster, wherein other clusters can point on it.
The term “parent cluster” means the cluster, which some another cluster points on it.
The term “usual cluster” refers hereinafter to any cluster in a tree that is not root cluster.
The term “child cluster” refers hereinafter to a cluster which points to another cluster.
The term “hamming distance” refers hereinafter to the number of positions between two strings of equal length at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.
In the context of the present invention, the term string refers to a sequence of characters. In a non-limiting example of PCN the term string refers to protein sequence or protein fragment, preferably comprising about 20 amino acids and the terms position or symbol refers to a single amino acid within the protein fragment or sequence.
The term “first similarity threshold” refers to the similarity value between the nodes in the original network. For example, in PCN, the similarity value between the nodes corresponding to the protein sequence fragments in the network may be determined according to a hamming distance between two protein sequence fragments or may be determined according to any other similarity calculation method. If this value is higher or equal than the first similarity threshold, for example 60% of identity, the nodes are connected by edge and become neighboring.
The term “second similarity threshold” refers to the similarity threshold that influences the construction of clusters in the compressed network. It defines when joining of two neighboring nodes into the same cluster should happen.
The term “edge” is defined hereinafter as the link between the corresponding nodes of protein fragments having sufficiently high sequence-wise similarity to satisfy a predefined threshold. According to one exemplary embodiment, an edge is defined as the link between nodes of amino acid sequence similarity of 60% or more.
The term “relatedness” or “resistance” refers hereinafter to similarity or dissimilarity between nodes in a network and in PCN to protein fragments or sequences, determined according to predefined weights or properties.
The term “lossy compression” refers to usage of approximate data or partial data to demonstrate content.
As described hereinabove, an example of such a huge network is a Protein Connectivity Network (“PCN”). A PCN can be very large in size, requiring many gigabytes of memory, both persistent and active, and consuming a considerable amount of computing resources and runtime when used for executing queries.
The purpose of the present invention is to compress such a very large network characterized by a first similarity threshold between neighboring nodes and nodes with a plurality of repetitions of sequence of characters, by using a clustering algorithm.
A compression is performed by dividing a huge network such as PCN into a set of clusters, where the clusters are considered as super-nodes in the compressed network. Based on similar method as in the multilevel approach described in “Proc. of the 6th SEAM Conference on Parallel Processing for Scientific Computing, 1993, 445-452; Hendrickson and Leland, A Multilevel Algorithm for Partitioning Graphs, Tech. report SAND 93-1301, Sandia National Laboratories, Albuquerque, N. Mex., 1993”, where the super-nodes are calculated as clusters.
In the new compressed network only information about clusters content and connections (clusters are connected if at least one connection between correspondent nodes of the clusters exists in the original network). This approach conserves a significant amount of space, while maintaining the general structure of the network. In other words, the disclosed technique is creating a smaller graph in which each group of nodes is well connected and loose nodes are removed.
The compression is achieved by eliminating the need to save internal edges between the clusters and edges of multiple connections of any two clusters (i.e. if two clusters are connected by several edges—it will correspond to only one edge in the compressed network).
In other words, the compression is based mainly on omitting all edges between two nodes inside a cluster. It is effective because restoring the edges is performed by calculating the similarity between relatively small finite groups. However this approach has two implications. On one hand, if the clusters are too small and only a small amount of edges can be removed then, a very small compression of the network may be generated. On the other hand, if the clusters are too big, restoring them to the original network state won't be feasible in a reasonably reduced amount of time. Therefore, in order to prevent generation of huge clusters the size of the clusters is limited to maximal size which is defined by the user.
In an exemplary embodiment, one approach to handle interconnecting edges between two different clusters or between a cluster to an external node may comprise retaining only one edge between connected clusters. While this approach may yield a great compression of the network, it may also cause a much longer recovery time since the similarity between each node pair within the connected clusters has to be calculated.
Another exemplary embodiment of the present invention includes putting a weight on the edge between clusters that indicates how many interconnecting edges there are.
The present invention may be very effective for similarity graphs in general and specifically for PCN, because high level of compression can be achieved. Moreover, the original network can be quickly reconstructed in spite the fact that the compression is “lossy”. The extremely fast run time of the decompression is due to (i) an indication that time of reconstruction of the edges in similarity graphs is o(n2), where n is amount of nodes, so, the reconstruction for many small groups (clusters) can be much quicker than for one large group (whole graph); and (ii) an effective approach for clustering with limiting of the maximal size of cluster the data lost was not great compared to the compression achieved. Loading the reduced network into memory allows performing very fast traversing queries over the network with little or no overhead of redundant input/output calls.
Additionally, there are many tasks where the compressed network can be used without first being decompressed. For example, the task of sequence annotation of proteins does not require reconstruction of the original network from the compressed network, i.e. it may be performed on a compressed network.
Compressing a very large PCN, on the order of several gigabytes in size down, to mere tens of megabytes in size according to the disclosed techniques, enables storing, searching, and querying such a very large PCN efficiently and relatively quickly. The entire reduced PCN may be loaded into a machine's operating memory and runtime complexity of compression is linear with the number of edges. A node in the network represents a protein sequence or a fragment or subsequence thereof. A node in the network may be bound by edges to one or more other protein sequences represented by nodes in the network.
An embodiment of the present invention will be explained below referring to the drawings.
As shown in
In an exemplary embodiment of the invention, where the network is a PCN, one purpose of the disclosed method is to build subgraphs of the original PCN using “biologically justified” or rational clusters as sub-graphs of the original PCN which consists of nodes connected with edges with first similarity threshold value, i.e., edges connecting nodes (i.e. peptide sequences) with higher similarity threshold value than the similarity threshold value in the original PCN.
The calculations of similarity are based on the finding of connected components of subgraph from the original network based on the increased similarity threshold. The similarity can be calculated on the base of the hamming distance (see Damian Szklarczyk, Andrea Franceschini, Michael Kuhn, Milan Simonovic, “The STRING database in 2011: Functional,” Nucleic Acids Research, vol. 39, pp. 561-568, 2011) or on the resistance value of the corresponding edge or may be calculated according to any other method.
Exemplary embodiments of the present invention may use an electrical model for defining relatedness through a network. This approach takes into account the network parameters, as they directly influence on electric properties that represent connectivity through the network. Such properties include conductivity or, oppositely, resistance. The approach has been more fully disclosed in “Frenkel, Zakharia, Zeev Frenkel, Edward Trifonov a Sagi Snir. Structural relatedness via flow networks in protein sequence space. Journal of Theoretical Biology, London: Elsevier, 2009, Vol. 260, July, p. 438-444. ISSN 0022-5193.”
The resistance through the network is further calculated by dividing the voltage by the current through the network. In a specific case the resistance is calculated as follows:
(1) An electrical voltage of 1V between the nodes of interest is considered.
(2) The electrical current i between the nodes is calculated. The current through the network may be calculated by the Ohm's and Kirchhoff's current laws.
(3) The resistance through the network is further calculated by dividing the voltage by the current through the network. Increasing resistance indicates decreasing similarity and vise versa.
To compress the original network into a reduced network consisting of clusters, the following steps are performed and include:
Clustering 1106;
Calculating the amount of root clusters 1108;
Renumbering of the clusters 1110;
Associating of nodes with new numbers of clusters (after renumbering) 1112;
Creating an output file of content of clusters 1114; and
building connections between the clusters for each said edge where nodes of said edges are connected to different clusters 1116 to yield a new PCN 1118 clusters 1120 and are detailed below with reference to
Clustering process begins with creating an empty dynamic list of clusters 10. The structure of the clusters list is as follows: for each cluster, the first member of an entry is a variable used to indicate a pointer to the parent cluster (or indicating that the cluster is a root, for the case of pointer is null or pointing to itself), the second member of the structure signifies the number of nodes in this cluster.
After the creating of an empty dynamic list of clusters, comes the step of creating an empty list of nodes 14. The structure of the list of nodes is as follows: for each node, a variable indicating the cluster number or pointer to the duster, that the node is assigned to, initialized to unassigned duster.
Computer resources, such as running time, i.e. building time of the compressed network and restoration time, along with disk space consumption (for the compressed network) may be affected by two parameters: a) the second similarity threshold; and b) maximum number of nodes in each cluster.
In
the number of edges in the compressed network;
disk consumption;
restoration time;
number of clusters; and
building time.
A second similarity threshold value for sequence similarity (for the construction of clusters in the compressed network) is predefined. The second similarity threshold value is above the first similarity threshold value in the network before the compression.
Restoring clusters with large amount of nodes will no longer be feasible in a reasonable time frame. Therefore, to prevent huge clusters (i.e. having large number of nodes) the size of the clusters is limited to a pre-set maximal size. A threshold value for a maximal number of nodes in a cluster is predefined. A user may select the parameters according to needs of speed of decompression, size and available disk space.
Thus, for example, in PCN where each position in a protein sequence can be filled with any of 20 amino acid letter values, a researcher skilled in the art may set the second similarity threshold for protein sequence similarity of 80%-90%.
When the clustering of the nodes is performed, for each edge of the original network 20, the following steps are performed:
Referring to
checking if the cluster is a root cluster 324 by either:
The purpose of renumbering of the clusters 340 is to create a new list of clusters that contains only root clusters, and reassign nodes to this list.
Creating a new list of clusters containing only the root clusters 342.
For each node in the list of nodes, reassigning accordingly the cluster the node is associated to 344.
Creating an output file of content of clusters 346.
The clusters that were found are used to create a reduced network which does not include any edges inside the clusters. These removed edges count for the data loss and the limited size of clusters promises the short recreating time. However unlike other compressed method which are used only to save storage, the present invention yields a reduced network which may be used for queries.
Building the reduced 350 is performed as follows:
For each edge in the original network 352, perform the following steps:
If the nodes of the edge are associated to different clusters in the corresponding entry in the list of nodes, set a connection between the two clusters if not already connected 356.
With reference to
A. An empty new network with amount of nodes equal to amount of proteins in the protein database is created 400, connection between the nodes is initialized as not connected 410.
B. For each pair of nodes in the original PCN 420, if the nodes belong to different proteins 430 and no connection is set already between them 440, build a connection between the two nodes in the new network 450.
Another exemplary embodiment of a method for network compression is by first building the network e.g. generating the edges, by walking first through the protein sequence space (being part of the input for the other methods described above), applying then clustering algorithm on these edges for generating the list of clusters and then connecting between the clusters, using same steps as in the first method.
The purpose of building first the PCN network is to provide efficient clustering when performing the clustering stage (many internal connections and few external connections).
The input is as follows:
Protein sequence database;
Parameters for the building of the network: size of words and hamming distance threshold for edge setting;
The sequence relatedness is commonly established by the observation of high similarity between two compared sequences. If more sequences are found that are related to one of the original sequences, a network can be formed, from connected points (sequences) in the sequence space. By further comparing all the points with all sequences available, keeping the threshold of pair-wise similarity constant, one generates an exhaustive network of sequence kinship.
Building the PCN network is same task as solving the k-mismatch search problem (alternatively called “string matching with k mismatch”. Any algorithm solving “approximate string matching” may be used, particularly the one described in “Evolutionary Networks in Formatted Protein Sequence Space, Journal of computational biology, 2007 October; 14(8):1044-57 by Frenkel, Z. M. and Trifonov, E. N”, Chap 2.2.
It is to be understood that the present invention is not limited to the embodiments described above, but encompasses any and all embodiments within the scope of the following claims. It will be appreciated by persons skilled in the art that the disclosed technique is not limited to what has been particularly shown and described hereinabove. Rather the scope of the disclosed technique is defined only by the claims, which follow.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IL2016/051220 | 11/10/2016 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62253708 | Nov 2015 | US |