The present invention relates to data transmission systems, and more particularly to improving the data transmission through compression.
The delivery of very large data sets is commonplace on today's Internet. For example, software updates, video-on-demand and peer-to-peer downloads of files typically involve data and files whose size can range from a few megabytes to several gigabytes or more. Moreover, the use and download of large data files like video and music over the internet is becoming more and more common among consumers.
Today's Internet has evolved a lot from the early days of the network, and the development of fast data transmission technologies for the consumer have made this possible. It is very commonplace for a consumer to have a fixed internet connection whose speed is in the order of megabits per second. Such speeds already allow the viewing or download of video files, easy download of music, having a data storage on the Internet, transmitting large files over e-mail and many other useful services for the consumer. All these services have been made possible by significantly faster fixed connections than what were available 10 years ago—a good connection in the last decade would be a connection of a few hundred kilobits per second.
There are more than four billion devices allowing mobile communication in the world today. At their fastest, the connection speed of these devices to the Internet is of the order of a few megabits per second, which already allows the same kind of useful services that have become commonplace over the fixed internet. However, the speed of the mobile networks can be clearly smaller e.g. in rural areas. The mobile communication devices can have a large memory space available for the users desired content. The memory capacity of a multimedia-enabled mobile communication device (e.g. a smartphone) can be more than 10 gigabytes.
Receiving data to the user device from the network and transmitting data to the network therefore requires efficient solutions. One technology that may help in the data transmission is caching, where a file that already exists in the device is not sent again from the network to the device. Caching technology is commonplace in internet browsers today. Another technology that may help in transmission of files is data synchronization technology such as SyncML. Data synchronization generally allows to retransmit only those files to the device that have been changed or created (so-called fast synchronization) after a previous synchronization (which may be a so-called slow synchronization). Yet another technology that may help in transmission of files is so called binary delta compression, where only the changed part of a file is transmitted. Unfortunately, these existing technologies are of little help regarding transmission speed in many situations such as where large new files need to be transmitted from the network to the user device, since according to these existing technologies, complete new files need to be transmitted. These existing technologies may also suffer from other shortcomings like significant processing overhead.
There is, therefore, a need for a solution that would alleviate the challenges where large files or large amounts of data need to be transmitted between the network and the user device, or between different user devices, or between different network elements.
Now there has been invented an improved method and technical equipment implementing the method, by which the above problems are alleviated. Various aspects of the invention include a method, an apparatus, a server, a client and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.
According to a first aspect, there is offered a method for data transmission at an apparatus using a first data connection. The method comprises forming at least a first client data chunk and a second client data chunk in the memory of the apparatus, wherein the first client data chunk corresponds to a first server data chunk and the second client data chunk corresponds to a second server data chunk, forming a first client digest for the first client data chunk in the memory of the apparatus, forming a second client digest for the second client data chunk in the memory of the apparatus, forming a parent client digest indicative of the first client digest and the second client digest in the memory of the apparatus, sending the parent client digest to a server, in response to the sending of the parent client digest, receiving instructions from the server for forming a first client data item using the first client data chunk and the second client data chunk, and forming the first client data item in the memory of the apparatus using the first client data chunk and the second client data chunk.
According to an embodiment, the method further comprises selecting the first client data chunk using a chunk selection function, wherein the chunk selection function is common for the server and the client.
According to an embodiment, the method further comprises making the first client data chunk and the first server data chunk correspond to each other over a second data connection prior to receiving the parent client digest at the server, wherein the second data connection is faster than the first data connection.
According to an embodiment, the method further comprises forming a plurality of parent client digests using a plurality of client digests in the forming of each parent client digest, and sending the plurality of parent client digests to the server using a digest negotiation protocol.
According to a second aspect, there is offered an apparatus comprising a processor and memory. The memory of the apparatus includes computer program code, and the memory and the computer program code are configured to, with the processor, cause the apparatus to form at least a first client data chunk and a second client data chunk in the memory of the apparatus, wherein the first client data chunk corresponds to a first server data chunk and the second client data chunk corresponds to a second server data chunk, to form a first client digest for the first client data chunk in the memory of the apparatus, to form a second client digest for the second client data chunk in the memory of the apparatus, to form a parent client digest indicative of the first client digest and the second client digest in the memory of the apparatus, to provide the server with access to the parent client digest, in response to the providing of the access to parent client digest, to receive instructions from the server for forming a first client data item using the first client data chunk and the second client data chunk, and to form the first client data item in the memory of the apparatus using the first client data chunk and the second client data chunk.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to select the first client data chunk using a chunk selection function, wherein the chunk selection function is common for the server and the client.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to monitor the access of the first client data chunk to form first access monitoring information, and to modify the chunk selection function based on the first access monitoring information.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to modify the chunk selection function to select larger chunks if the access monitoring information indicates frequent access.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to make the first client data chunk in the memory of the apparatus and the first server data chunk correspond to each other over a second data connection prior to receiving the parent client digest at the server, wherein the second data connection is faster than the first data connection.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to compute the first client digest, the second client digest and the parent client digest using a hash function, and to form a directed acyclic graph representation of the first client digest, the second client digest and the parent client digest.
According to a third aspect, there is offered a method for data transmission at an apparatus using a first data connection. The method comprises forming at least a first server data chunk and a second server data chunk, wherein the first server data chunk corresponds to a first client data chunk and the second server data chunk corresponds to a second client data chunk, forming a first server digest for the first server data chunk in the memory of the apparatus, forming a second server digest for the second server data chunk in the memory of the apparatus, forming a parent server digest indicative of the first server digest and the second server digest in the memory of the apparatus, receiving a parent client digest originating from a client, comparing the parent client digest and the server client digest, in response to the comparing, providing the client with access to instructions for forming a first client data item using the first client data chunk and the second client data chunk.
According to an embodiment, the method further comprises forming a plurality of parent server digests using a plurality of server digests in the forming of each parent server digest, and receiving a plurality of parent client digests originating from a client using a digest negotiation protocol.
According to an embodiment, the method further comprises selecting the first server data chunk using a chunk selection function, wherein the chunk selection function is common for the server and the client.
According to an embodiment, the method further comprises monitoring the access of the first server data chunk to form first access monitoring information, and providing access to the first server data chunk for the client based on the first access monitoring information.
According to a fourth aspect, there is offered an apparatus comprising a processor and memory. The memory of the apparatus includes computer program code configured to, with the processor, cause the apparatus to form at least a first server data chunk and a second server data chunk, wherein the first server data chunk corresponds to a first client data chunk and the second server data chunk corresponds to a second client data chunk, to form a first server digest for the first server data chunk in the memory of the apparatus, to form a second server digest for the second server data chunk in the memory of the apparatus, to form a parent server digest indicative of the first server digest and the second server digest in the memory of the apparatus, to receive a parent client digest originating from a client, to compare the parent client digest and the server client digest, and, in response to the comparing, to provide the client with access to instructions for forming a first client data item using the first client data chunk and the second client data chunk.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to select the first client data chunk using a chunk selection function, wherein the chunk selection function is common for the server and the client.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to monitor the access of server data to form first access monitoring information, and to modify the chunk selection function based on the first access monitoring information.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to form a plurality of parent client digests using a plurality of client digests in the forming of each parent client digest, and to send the plurality of parent client digests to the server using a digest negotiation protocol.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to form the plurality of parent client digests comprising a first parent client digest and a second parent client digest, wherein both the first parent client digest and the second parent client digest relate to the first client data item, and to use at least partly different client digests in the forming of the first parent client digest than in the forming of the second parent client digest.
According to an embodiment, the apparatus further comprises computer program code configured to, with the processor, cause the apparatus to compute the first server digest, the second server digest and the parent server digest using a hash function, and to form a directed acyclic graph representation of the first server digest, the second server digest and the parent server digest.
According to a fifth aspect, there is offered a computer program product stored on computer readable medium comprising computer program code that is configured to, when executed on a processor, cause an apparatus to form at least a first client data chunk and a second client data chunk in the memory of the apparatus, wherein the first client data chunk corresponds to a first server data chunk and the second client data chunk corresponds to a second server data chunk, to form a first client digest for the first client data chunk in the memory of the apparatus, to form a second client digest for the second client data chunk in the memory of the apparatus, to form a parent client digest indicative of the first client digest and the second client digest in the memory of the apparatus, to provide the server with access to the parent client digest, in response to the providing of the access to parent client digest, to receive instructions from the server for forming a first client data item using the first client data chunk and the second client data chunk, and to form the first client data item in the memory of the apparatus using the first client data chunk and the second client data chunk.
According to a sixth aspect, there is offered a computer program product stored on computer readable medium comprising computer program code that is configured to, when executed on a processor, cause an apparatus to form at least a first server data chunk and a second server data chunk, wherein the first server data chunk corresponds to a first client data chunk and the second server data chunk corresponds to a second client data chunk, to form a first server digest for the first server data chunk in the memory of the apparatus, to form a second server digest for the second server data chunk in the memory of the apparatus, to form a parent server digest indicative of the first server digest and the second server digest in the memory of the apparatus, to receive a parent client digest originating from a client, to compare the parent client digest and the server client digest, and, in response to the comparing, to provide the client with access to instructions for forming a first client data item using the first client data chunk and the second client data chunk.
The different aspects and embodiments of the invention offer several advantages. The communication of the parent digests enables reduced data communication between the server and the client. The forming of a plurality of parent digests enables the most efficient data compression to be selected. The monitoring of access information allows to improve the data compression by selecting the formation of parent digests in an optimal manner. The use of a fast data connection in making the data chunks at the server and at the client to correspond to each other enables to communicate the bulk of data using a fast connection, and communicating smaller amount of data comprising the digests using a possibly slower connection.
In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which
a shows a system employing remote differential compression in updating a data file, where the system uses MD4 hashes to identify data chunks;
b shows a system and devices according to an embodiment where a mobile device is in operative connection with at least one server, and data can be transferred according to the embodiment between these devices;
In the following, several embodiments of the invention will be described in the context of data transmission between two devices over a network. It is to be noted, however, that the invention is not limited to network environments, but can be implemented in other environments, as well, such as any environments where two devices are in data connection with each other and inside a single device where two elements of the single device are in data connection with each other. In fact, the different embodiments have applications widely in any environment where optimization of data transmission is required.
One of the problems the embodiments seek to alleviate is to reduce data transmission costs by reducing the number of bytes transmitted, the required delivery time and the processing overhead. The problem is relevant for devices that operate at the edge of the network, for example wireless and mobile communication devices. Different embodiments are motivated by the fact that storage capacity is evolving faster than wireless data transmission rates. This means that by storing sufficient amount of data at the mobile device out-of-band, and being able to inform a server about this data, compression of data can be performed by relying on this out-of-band shared information. This benefits also processing requirements, since only the compressed fragments are encrypted and signed.
a shows one possible way of reducing data transmission costs between a client 101 and a server 130. The client 101 has an original file 102 stored in its memory. The original file consists of four sections that can be represented by so-called digests or hash values 103-106, in this case computed using the well-known MD4 algorithm. The server 130 has an updated file 131 stored in its memory, consisting of five sections that can be represented by hash values 132-136. This updated file 131 has been formed by modifying the original file 102 by replacing one element with two updated elements. The client 101 seeks to update its original file 102 to the updated version 107 that is identical to the version 131 of the file stored on the server. In order to achieve this, the client 101 sends a request 121 to the server. The server responds by sending the digests or hash values 132-136 of the sections of the file in a message 122. The client 101 compares the hash values 132-136 sent by the server to the hash values 103-106 it has in its own memory. The client detects that the hash values 103 and 132, the hash values 104 and 133 and the hash values 106 and 136 are identical, and also that it does not hold the data corresponding to the server hash values 134 and 135. It therefore requests the data chunks corresponding to the hash values 134 and 135 from the server in a message 123. The server sends the data chunks in a message 124 to the client, and the client is able to construct the updated file.
In the operation of
b illustrates a system and devices according to an embodiment. The system comprises a device 150, possibly a mobile terminal at the use of an end-user, and in connection to the network NW 170, and servers 180 and 190 in connection to the network NW 170. The devices can be either in fixed connection or in mobile connection with the network NW such as GPRS, UMTS, WLAN, Bluetooth, 10 Mbit/s, 100 Mbit/s or Gigabit ethernet or other wireless or wired data communication protocols. The device 150 may comprise a display 152 for displaying information to the user, memory 154 for storing data, a processor 156 for processing data, communication module 158 for connecting to the network 170 and for sending and receiving information, and a keyboard 160 for receiving input from the user. The server 180 may comprise memory 184 for storing data, a processor 186 for processing data, and a communication module 188 for connecting to the network 170 and for sending and receiving information. The server 190 may comprise memory 194 for storing data, a processor 196 for processing data, and a communication module 198 for connecting to the network 170 and for sending and receiving information. The devices 150, 180 and 190 comprise memory for storing data and they are able to send messages and data between each other via the network 170.
The server 220, the client 230 or the data manager 240 may also monitor 209212 the used data sets with the help of adaptive data set updaters 224 and 234 (at the source or at the destination) and if there is frequent activity pertaining to certain domain or service, it may check if a data set is available for compression. If the data set is available, the system may load new data sets 210 using, and it may even use compression for the transmission of the new data set. In this monitoring, the data sets 221 and/or 231 may also be kept the same and new data set compositions and reference skeletons data access frequency may be created. The system may thus support updates to the reference skeletons (new digest structures) that better reflect frequent interaction patterns. This may improve the efficiency of data communication.
The shared data 221 and 231 can be data files, they can be partial data files or the shared data can be data especially composed for the purpose of differential compression. In an embodiment it may be assumed that the base data sets are not mutable. This means that e.g. Merkle trees may offer a very convenient method for generating a digest structure for a data set. The data set (assuming a large file) then has a single hash value that uniquely identifies the data set in question. Moreover, it is possible to apply the Merkle tree procedure using different block sizes (fixed or varying) for the same data set. Thus we can represent elements of the same data set using compact labels. This may further improve the efficiency of data communication.
The manager component 240 can be the same as a content server or a web site or it can be a different element like a proxy. The manager component may be located on the network (Internet), it may be provided by a Content Distribution Network (CDN) or it may be provided by a large web site (OVI, Facebook, etc.). The manager 240 may be the source of the bulk loaded data. It may accepts frequency data as input and as a response to the frequency data, it may output digest structure or hash tree information. It is possible to use the system without this manager component 240, however, employing the activity information or the usage patterns may increase performance of the system.
The manager component may not be directly involved in the communications. If a data set whose hash value is not recognized is met, the manager can be consulted. The manager can also be informed (by servers or clients) about how well a given chunking partitioning (digest structure or hash tree) works and give feedback to create better partitions. A server can do this also without the manager by simply creating a new digest structure or hash tree and instructing the client how to construct it based on the existing data set.
Existing synchronization and caching techniques may be improved by employing a negotiation phase in data communications that is used to identify bulk data sets loaded by a client beforehand. The knowledge of the bulk data sets are then used to optimize communications. A special signature or digest or hashing scheme is used to identify parts of a bulk data set. In the negotiation phase the client informs server about supported data sets. These data sets may be chosen by the client for example on the basis of a MIME type used in communication or in another way that enables the use of the type of the data being communicated. The server may then be allowed to choose a selection of the data sets for the differential compression.
The client may inform the server about the data sets it supports and thus the server can then decide which one to use and send the compressed data. The representation for the compression can use any of a number of compression techniques, including delta compression by simply referring to parts of the bulk data set. Since referring to a part of a document will require at least a pointer and a size field, it is expected that there is a minimum required length for the elements to be considered for delta compression. One simple approach is to simply divide the file into blocks of a fixed size and then compute the signatures or digests of hashes.
According to an embodiment of the invention there is also offered a protocol for exchanging information on multi-level hash representations or hash trees. The digests or hashes are composed into a multi-level acyclic representation, or a tree, and the composition of the trees can effectively be communicated from the server to the client or vice versa.
According to an embodiment of the invention, the shared data sets for the client and the server are based on the profile of the user of the client. The data can be operating system files, software, multimedia data such as music, video or images, cached web sites and web content, or any other data. These data sets are then installed to the server and to the client. They are partitioned, signatures or digests are formed for the partitions either before or after installation, and the digests or signatures are formed into a multi-level structure of digests or hashes. In this multi-level structure or tree, at least two signatures or digests or hashes are combined and a parent digest is computed for them. This parent digest may be formed using a Merkle tree, and the parent digest may be used to identify the data sets. The parent digest may then be used in the communication enabling differential compression.
The data sets may also be updated using the same compressed data communications according to an embodiment. The existing data set can be used to send and receive differentially compressed updates to the server and the client. The update data may be partitioned in chunks. An algorithm may be used to find non-changed chunks to shared data (hash lookup). An algorithm may also find shared chunks that need minimal changes and those chunks may be updated separately.
It is to be understood that the above embodiments of the invention can also be combined, For example, the web browsing scheme of
The digests or signatures for the data chunks can be computed in a hierarchical fashion, for example by using a Merkle tree. Then the signatures can be checked (top-down) against the bulk data, for example using hash table lookup. This requires that the bulk data has a hash table-based lookup index. This is a reasonable requirement and will result small delta compression overhead due to first computing the hashes or a hash tree, and then doing constant-time lookups.
A Merkle tree is a complete binary tree that has a hash function h and an assignment O. The function h is a one-way hash function such as SHA-1. O maps the set of nodes to the set of k-length strings: n→O(n) belongs to {0,1}k. For any interior node, nparent the assignment φ must satisfy φ (nparent)=h(φ(nleft)∥φ(nright)). The value of φ(I) for a leaf node I can be chosen arbitrarily. It is clear that this construction can be extended to cover trees that have more children than two.
In a practical implementation, a Merkle-tree based construction can be used to represent the delta signatures or the data chunk digests. Merkle-trees are meaningless unless the sender and receiver have the same bulk data set as a common reference. Therefore, they have intrinsic security properties. Merkle trees can be applied to a data set (file) to partition it into fixed or variable sized chunks and then derive a common hash label for the whole data set. The partitioning can be based on an expected update frequency (some types of data may be such that they are typically modified more often than others). The hash tree can cover a part of the file, a whole file, parts of at least two files or parts and wholes of at least two files. Merkle tree gives a way to distinguish between data sets and refer to certain parts of a data set. Merkle trees can also be used to verify data during the loading of a data set.
Merkle trees offer to generate a number of partitions for a large data set and derive a very compact representation for them. The motivation is that it may turn out that a new access distribution is identified that emphasizes certain larger sequential data blocks in the file. Now, we can simply generate a new Merkle tree that has this more frequent data as an atomic block, we generate a new hash root value which uniquely identifies this new “skeleton” for the data set. It is now sufficient to simply update the clients with his new tree (update the block size algorithm). This offers flexibility.
Practically, the embodiment of the invention may happen as follows. The client 1001 sends root hashes of the data sets to the server. The data sets may be application specific (one for messaging, another one for office documents, etc.). The mapping can be done automatically based on, for example, MIME type. It is also possible to send a Bloom filter (probabilistic data set) that covers all the supported root hashes (data set identifiers). With a Bloom filter, it is possible to detect whether a certain root hash is supported by the client 1001 or not without sending all the root hashes as values themselves in the communication to the server 1002. The server 1002 then checks whether or not the data set is supported. If not, then normal operation according to state of the art technologies is assumed (normal HTTP transmission, for example). If data set is supported, the server 1002 sends a differentially compressed version of the data to the client 1001. This can be based on a single data set or multiple data sets. The server can perform the differential compression beforehand or it can be done on the fly. When the client 1001 receives the differentially compressed data, it can reconstruct the original data by looking up the chunks (and parts of chunks) from the local data sets involved, using the digests sent in the server response 1004.
Differential compression enables the server to send to the client only the data that are different from what exists at the client already. If the client already has data chunks that allow it to build most of the data that the server are transmitting, the server will detect this and not send those parts. The server sends the client the data that the client does not have and instructions on how to update the data that the client already has. The client can then reconstruct the data although all data are not sent from the server to the client. The forming of the differential information can happen on the fly or it can be precomputed before the client requests the data.
The hashes or digests can be communicated using HTTP headers as follows. The TE request-header field in the HTTP 1.1 protocol indicates what extension transfer-codings the client is willing to accept in the response.
The client sends a request to the server using the GET operation of the http protocol. The host name www.example.com is indicated in the Host section of the request. Above, TE stands for transfer encodings and is used to indicate the type of compression the client supports. The TE field includes the SHA-1 hashes of the data sets.
The server responds by sending the 200 OK message to the client, indicating a successful request. This is followed by information of the server, date and time information and content description. The TE field (transfer encoding) indicates that a differential compression is used. The content-type field shows which type of content is in question. The transmitted data is, in the above example, differentially encoded content, where information of the necessary changes with respect to the client datasets are sent.
In delay tolerant or delay-enabled operation, the client delays the transmission of messages in order to wait for similar kinds of messages in order to decrease networking and processing costs. Similarly, the server may delay the sending of a message in order to accumulate more data that can be compressed. For the client, this is mostly useful for applications that generate a lot of non-interactive requests to servers that do not require immediate feedback (for example, document editing). This feature can be indicated in the HTTP header so that server will know that the client supports this.
Various ways of implementing various embodiments in a practical setting are possible. Different embodiments can be implemented as an add-on for current Internet content delivery protocols. The data sets and digests can be identified in a header of a protocol, such as the HTTP or SIP header, thus making it possible to deploy the system in a transparent fashion. The data being delivered in between two devices (peer-to-peer, server-to-client, network element to network element) may for example be video, music, images, maps, user files, calendar information, visual presentations, books and articles and spreadsheets. Web browsing, e.g. using the same content many times, a popular website or a popular set of images may use an embodiment as presented earlier. Various embodiments may be applicable for verifying the data for malware. Applications for delivering data and software in cloud computing environment, computing results and input data for computing may be useful. The embodiments may be used in bittorrent-like data deliveries where data is coming from a number of sources to a single client or broadcasts where data is being sent from a single source to multiple recipients. Streaming data can be also supported, but it may require that the signatures (and delta coding) is done in real-time. Applications in compression of any messages transmitted between two devices or inside devices may be found. Ways for data clustering based on commonalities in data may be offered, since this happens automatically due to identifying the data sets. Subscription services for updates (e.g. software updates and distribution) may be offered. Virus and malware scanning based on differentially compressed updates using cloud services may be done. If the OS and libraries are shared with the cloud service, modifications to OS and libraries may be checked. The security service may maintain a set of suspicious update signatures and how the update message will look.
The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a terminal device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the terminal device to carry out the features of an embodiment. Yet further, a network device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
It is obvious that the present invention is not limited solely to the above-presented embodiments, but it can be modified within the scope of the appended claims.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/FI09/50347 | 4/30/2009 | WO | 00 | 10/31/2011 |