1. Technical Field
This disclosure relates generally to distributed computer networks and in particular to techniques for maintaining the contents of selected objects (e.g., directories, files, and the like) in synchronization (“sync”) across multiple participating computers (a “sync network”).
2. Background of the Related Art
Remote access technologies, products and systems enable a user of a remote computer to access and control a host computer over a network. Internet-accessible architectures that provide their users with remote access capabilities (e.g., remote control, file transfer, display screen sharing, chat, computer management and the like) also are well-known in the prior art. Typically, these architectures are implemented as a Web-based “service,” such as LogMeIn, GoToMyPC, WebEx, Adobe Connect, and others. An individual (or subscriber) who uses the service has a host computer that he or she desires to access from a remote location. Using the LogMeIn service, for example, the individual can access his or her host computer using a client computer that runs web browser software.
Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads among peers. Peers are equally privileged, equipotent participants in the application. They form a peer-to-peer network of nodes. Peers make a portion of their resources, such as processing power, disk storage or network bandwidth, directly available to other network participants, without the need for central coordination.
A real-time file synchronization application that ensures selected storage objects (e.g., folders or files) always have identical content on all participating computers generates significant network traffic, and keeping such a highly-volatile network connected demands substantial computational power and bandwidth.
This disclosure describes a technique to build an optimal network that is capable of transferring a virtually unlimited amount of data between a virtually unlimited number of computers using a de-centralized approach to minimize operating and computational costs.
According to this disclosure, one or more synchronization networks are constructed, preferably by arranging an ordered list of resource identifiers (e.g., IP addresses and/or other parameters) into connected circle graphs, ensuring geographic optimization where nearby computers can communicate with each other efficiently. This approach minimizes bandwidth usage, increases response time and distributes the computational load across participating computers.
According to a more specific aspect, a method is operated at a coordinating entity to organize a set of hosts into a synchronization network. The coordinating entity maintains information that a particular host is online and available to be organized into the synchronization network. To that end, the coordinating entity assigns an identifier (a node identifier) to each host that is online, and that identifier is unique within the particular synchronization network. The coordinating entity then orders the node identifiers for the set of hosts (based on a given characteristic of the node identifiers). In one embodiment, the ordering organizes the synchronization network in a particular topography, such as a circle. Based on the ordering, the coordinating entity provides each host that is online with a list of K online hosts to enable each host to establish and maintain connections with K of its neighbor hosts. In this scheme, K is a value that is the same for all hosts within the synchronization network.
The foregoing has outlined some of the more pertinent features of the subject matter. These features should be construed to be merely illustrative. Many other beneficial results can be attained by applying the disclosed invention in a different manner or by modifying the invention as will be described.
For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
The following describes a technique to build a network capable of transferring virtually unlimited amount of data between virtually unlimited amount of computers, with low operating and computational costs. For convenience of description, the network is sometimes referred to herein as a “synchronization network” (or a “sync network”). Using the remote access architecture of
In this architecture, the gateway 210 coordinates hosts 202, 204, 206 and/or 208 using a control connection (dotted line 214), and the gateway 210 also relays data between hosts that are not capable of communication in a non-relayed manner (solid line 216). The database (DB) 212 holds host details (e.g., name, ID, online status, description, IP address, geo-location, and the like), assignments of hosts and synchronization network, and other information as necessary to build up and optimize networks, as will be described in more detail below. Together, the gateway 210 and its associated database 212 comprise a coordinating entity that organizes and maintains a synchronization network. The data connection (the solid line 216) is a connection that transfers application data between hosts. The control connection (the dotted line 214) is a connection that hosts use to communicate with the gateway 210. Preferably, no application data is transferred over the control connection, although this is not a limitation.
According to this disclosure, a plurality or set of hosts are organized by the coordinating entity into one or more synchronization network(s), and (as noted above) there may be multiple gateways/databases. The number of hosts that comprise a particular synchronization network at any time may vary, as one or more hosts “leave” the network or “join” the network. Thus, the synchronization network (as an overall entity) is a dynamic construct that may be changing randomly or deterministically.
For convenience, a host is sometimes referred to herein as a participating machine (typically, e.g., hardware and software components supported thereon). For discussion purposes, a machine is referred to sometimes as a “computer” but this is not a limitation in that it implies a fixed computing resource. A participating machine may be a handheld or mobile device, such as any wireless client device, e.g., a cellphone, pager, a personal digital assistant (PDA, e.g., with GPRS NIC), a mobile computer with a smartphone client, or the like. More generally, a host is an endpoint of a communication. According to this disclosure, with respect to a particular synchronization network, endpoints are first assigned a unique ID (NodeID) (or, more generally, a “node identifier”) and ordered into a graph, such as a “circle” graph. Each node is then responsible for ensuring it is always connected to its nearest K neighbors where K≧2. By selecting the NodeID prudently, this approach guarantees that related computers are close to each other in a circle graph. Preferably, the logic for creating the NodeID is maintained in the coordinating entity database 212; as a consequence, an optimization can be changed at any time without endpoint deployment. As will also be described in more detail below, this circle graph can be further optimized by introducing so-called “crossover” connections between distant nodes.
Thus, according to an embodiment, a unique ID (NodeID) is created and assigned to each endpoint in a synchronization endpoint. As noted above, a particular machine may be (and often is) an endpoint in more than one synchronization network. In one embodiment, tndpoints form a circle (or circular) network in which they are ordered by their NodeID. NodeIDs are configured such that related computers are near each other. For example, to organize computers by their geographical distance (which is not a limitation), a NodeID utilizes an IP address with some additional network related parameters. This embodiment is described in more detail below.
The following provides additional details regarding this technique. By way of background, the problem of making sure that a network (graph) with arbitrary topology is connected requires that the coordinating entity be aware when computers are added to the network, computers are removed from the network, computers come online, and computers go off-line. According to a feature of this disclosure, the computational cost for keeping the network “connected” is dramatically reduced by using a graph topology, such as a circle graph. This is because a circle graph always stays connected if all nodes have at least K≧2 neighbors, where K is the degree of a node.
According to this disclosure, the above network is sustained (for connectivity purposes) by enforcing a pair of rules: (1) nodes within the network are ordered, and (2) each host is always connected with exactly K of their nearest online neighbors.
Thus, according to this technique, when some particular set of hosts need to communicate among each other, a circle network is created. The following algorithm may be used to build up such a network.
With the above as the preferred constraints, the following algorithm is then used to create and maintain a synchronization network:
Thus, according to the disclosed technique, each synchronization networks forms a graph, such as a circle graph. Within the circle graph, hosts are ordered by the NodeID. Each host maintains connections with K of its closest neighbors. By default, K is the same for all hosts within a sync network but, preferably, its value depends on the number of hosts (N) in the network.
According to the algorithm, a host is responsible for ensuing it is always connected with exactly K of its nearest online neighbors. Preferably, a host initiates a connection in two steps, which are now described. First, the host indicates to the gateway that a new connection is needed, forwarding a host identifier. This communication also requests that the gateway provide the requesting host with the host's closest online hosts. The gateway executes a “get closest online hosts” routine for this purpose. The gateway responds to the communication and the request by obtaining (from the database) the “K” nearest online neighbors of the node of network. The gateway provides this “list” to the requesting host. The host, having received the list from the gateway, then starts connecting one by one to the nodes identified in the list. If any such connection attempt returns an error, preferably the host calls the gateway and again requests the lists of its closest online neighbors.
When a host loses any of its mandatory connections, preferably it calls the gateway's “get closest online hosts” routine to attempt to maintain the circle graph. The host need not call this function when an extra connection builds up from another host, however.
The technique provides numerous advantages. It provides a highly scalable and optimized network enabling the efficient transfer of data while minimizing operating cost. The technique is much more scalable and cost effective than current approaches, based on the non-centralized architecture and unique use and assignment of NodeID's resulting in an efficient and optimized network. The technique enables hosts to create a peer-to-peer network that spares server and database resources by eliminating the need for centrally maintained connections. The technique optimizes the network from an arbitrary perspective, and the connection logic can be changed in a central database without any need to deploy or change the endpoint software. The technique also is advantageous as it favors P2P connections and limits the number of relayed connections. It also ensures that the nodes that are physically closer are likely to be connected, which provides geographic balancing. In addition, the technique provides a degree of bandwidth balancing, as hosts with stronger bandwidth/capacity typically have more connections.
The “ordering” of the NodeIDs may be based on one or more characteristic(s) of the identifiers. In one non-limiting embodiment, the ordering is based on byte order for the IP addresses of the hosts. In an alternative, the ordering is based on the values generated by concatenating a host IP address with an additional value, where the additional value is one of: a flag representing a P2P capability associated with the host, a load associated with the host, an amount of storage or processing capability associated with the host, a latency associated with the host, geo-data associated with the host, other physical, network and/or content characteristics, or some combination thereof.
Preferably, the NodeID is unique within a particular synchronization network. Preferably, and as noted above, the NodeID is numerical, which facilitates the defining of an ordering amongst the hosts that are participating in the sync network. As noted above, each host maintains connections with K of its closest neighbors, where K is the same for all hosts within a given sync network. The value K may be dynamically configurable. By selecting the NodeID prudently, the technique described herein ensures that related computers are close to each other in a circle graph. By maintaining the logic for creating the NodeID only in the gateway and/or database, the logic can be changed at any time without host deployment changes.
As described above, preferably the NodeID is created such that the most significant bytes are comprised of the public IP address of the host. This approach provides several benefits. First, nodes that are on the same physical local area network (LAN) typically share the same public IP address. Therefore, if two nodes (from the same LAN) are next to each other in the graph, they can easily connect to each other, e.g., by using a P2P connection library that is capable of establishing a direct LAN (or hairpin NAT) connection, and this connection may be made without leaving the LAN (thus resulting in no Internet traffic). Further, nodes that are on the same ISP likely have the same class C or class B subnet; therefore, they will connect to each other as well, thereby minimizing the amount of traffic that has to leave the ISP's network.
As a particular circle graph becomes large, communications between distant points can become slow as the number of hops between the hosts increases. To address this issue, a particular circle node can be optimized by creating additional crossover connections between particular (distant) endpoints. These are so-called “crossover” connections, such as connection 502, as indicated in the example network 500 shown in
One technique for implementing such crossover connections is extending the circle graph buildup algorithm (get closest online hosts) to include an arbitrary number (L, where L≧0) of distant hosts, in addition to the ‘K’ nearest online hosts. These connections are not mandatory so if they fail to build up, the host does not necessarily have go back to the GW for a new list. By creating crossover connections between distant nodes that are capable of communicating in a non-relayed manner reduces the amount of relayed traffic, which results in a more cost-efficient and better performance network.
Although not meant to be limiting, the controlling entity may select the one or more optimization connections randomly, deterministically, or a combination thereof. In one approach, the controlling entity selects the furthest host across the diameter of the circle graph.
While the above describes a particular order of operations performed by certain embodiments of the invention, it should be understood that such order is exemplary, as alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, or the like. References in the specification to a given embodiment indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic.
While the disclosed subject matter has been described in the context of a method or process, the subject disclosure also relates to apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including an optical disk, a CD-ROM, and a magnetic-optical disk, a read-only memory (ROM), a random access memory (RAM), a magnetic or optical card, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
While given components of the system have been described separately, one of ordinary skill will appreciate that some of the functions may be combined or shared in given instructions, program sequences, code portions, and the like.
Having described our invention, what we now claim is as follows.