Consider a mobile or personal computing device to be backed up using an online or “cloud” service provider. All new and changed data on the device has to be uploaded to the service providers storage for every backup. Routine backups may occur weekly, daily, or even more frequently. Uploading the data to be backed up consumes expensive and sometimes slow bandwidth. Reducing bandwidth consumption can add value for consumers and enterprises, especially those using an asymmetrical link such as a cable modem or digital subscriber line (DSL).
A number of techniques are directed at improving backup operations. These include, for example, compression algorithms and deduplication. While compression algorithms may reduce the amount of data that has to be transferred for backup, compression/decompression may increase the time it takes to complete a backup operation. Deduplication also reduces the amount of data that has to be transferred for backup, but uses extensive indexing which can also increase the time it takes to complete a backup operation.
a-b show example architectures for reducing backup bandwidth by remembering downloads to a computing device, including executable machine readable instructions.
a-c are flowcharts illustrating example operations that may be implemented to reduce backup bandwidth by remembering downloads to a computing device.
In an era of electronic data, backups are routine for enterprises and even individuals who desire to backup their personal computers, laptops, tablets, and mobile devices. In an effort to provide backup service regardless of a user's location, and to make the backup process as seamless and effortless as possible, online or cloud backup services have become commonplace. As noted above, however, uploading the data to be backed up can be slow and/or expensive, especially over asymmetrical network connections (e.g., upload speeds are sometimes only one-tenth of download speeds).
Much of the data found on computing devices is retrieved from online or network locations (e.g., the Internet and/or enterprise networks). The systems and methods disclosed herein track data on devices that has been downloaded from a network. Example data that is available from these networks may include, but is not limited to email, application software and “mobile apps,” and PDF documents. In an example, the systems and methods remember the new downloaded data off of the computing device, and/or remember a source of where the new downloaded data came from. As such, the backup provider is able to retrieve the data without having to upload that data from the device.
An example system may include program code stored on one or more non-transient computer-readable storage mediums. The program code is executable by one or more processors to remember information for a download to a computing device, and backup the computing device to a different system. The information remembered for the download is used to provide a backup of the computing device without copying some of the downloaded data present on the computing device from the computing device.
In an example, the program code is further executable by the one or more processors to determine which pieces of data on the computing device are available from a source of the downloaded data, and retrieve those pieces of the data from the source of the downloaded data instead of from the computing device. In another example, the program code is further executable by the one or more processors to route the download for the computing device through at least one proxy node, store a copy of the downloaded data at the at least one proxy node, and remember that the downloaded data is stored at the at least one proxy node (e.g., for restore operations). It is noted that modern mobile device browsers are already often having their requests routed through online proxies, which may be modified as described herein so as not to add latency.
It is noted that the systems and methods described herein may be implemented orthogonal to existing backup techniques, and indeed may even be practiced in combination with those techniques. For example, the techniques disclosed herein may be integrated with deduplication, where for example, deduplication is used to transfer a modified version of downloaded data present on the computing device by deduplicating it against the originally downloaded data, which can be retrieved from other than the computing device using the systems and methods described herein and the remembered information.
Other backup techniques now known or later developed, may also be used to backup data on the computing device that has not been downloaded (e.g., created on the computing device by taking a picture) or that is downloaded data but had no information remembered about it for whatever reason.
The specific bandwidth savings realized by using the techniques described herein depend at least to some extent on empirical factors that can be determined on a case-by-case basis. An example factor includes how much new or “unique” data is downloaded to a device between backup operations. It is noted that the term “unique” is used herein to mean either “actually unique” or sufficiently far down a long tail that it is not cost effective to deduplicate against that data.
It is noted that in an example, the systems and methods described herein are directed generally to backing up the computing device, not the downloaded data. By the time the backup occurs, some of the downloaded data may no longer be present on the computing device. The systems and methods described herein allow for cases where the user modified the downloaded data and/or the download source has been updated since the last backup.
Before continuing, it is noted that as used herein, the terms “includes” and “including” mean, but is not limited to, “includes” or “including” and “includes at least” or “including at least.” The term “based on” means “based on” and “based at least in part on.”
The communication network 120 may provide a user 101 with access to network sites 130 (e.g., a website), including one or more content sources 135a-c. The content source 135a-c may be a remote source of content (e.g., provided on a wide area network or WAN such as the Internet or an enterprise network), and/or a distributed source of content.
The content source 135a-c may include any type of content. For example, the content source 135a-c may include email services, applications, databases and other storage resources for providing documents, videos, audio, and other data files. There is no limit to the type or amount of content that may be provided by a source. In addition, the content may include unprocessed or “raw” data, or the content may undergo at least some level of processing.
The computing devices 110 may access the network sites 130 via communications network 120. The communications network 120 may be accessed through any suitable connection, such as a carrier network 140a (e.g., a 3G or 4G network) and/or wired or wireless access point or WAP 140b (e.g., WiFi).
Typically in consumer systems, download speeds are much faster than upload speeds. Thus, users may experience fast downloads, but, online backup services may prove slow when uploading data from the computing devices 110 using an online or cloud backup service. Also, the user may be subject to bandwidth caps (e.g., a limit to how much bandwidth he may consume per month) and may wish to spend the limited bandwidth available watching movies, for example, rather than running backups. Therefore, the system 100 may include a backup service 150 to reduce backup bandwidth by remembering downloads to the computing devices 110.
The backup service 150 may be configured as server computer(s) 152 with computer-readable storage(s) 154. For purposes of illustration, the backup service 150 may be an online service executing program code or backup code 155. The backup code 155 may be executable by one or more processors (e.g., by server computer(s) 152) to backup the computing devices 110 to a different system from computing devices 110 (e.g., storage 154 or other storage system). The backup service 150 may arrange for information to be remembered for a download. For example, instructions for using the service may instruct the user to set his or her browser to use a proxy on the mobile device, or the user may download an “app” including some or all of backup code 155 to the mobile device to setup and/or perform backup. Other examples are also contemplated. The remembered information enables providing backup of the computing device 110 without having to upload at least some of the downloaded data present on the computing device 110 from the computing device 110.
In an example, the backup code 155 may determine which pieces of data on the computing device(s) 110 are available from an online source that provided the downloaded data. As such the backup service 150 can retrieve those pieces of the data directly from the online source of the downloaded data without having to upload those pieces of data from the computing devices 110. In another example, the backup service 150 may arrange for downloads for the computing device(s) 110 to be routed through proxy node(s) 160. A copy of the downloaded data is stored at the proxy node(s) 160. Accordingly, the backup service 150 only has to remember that a copy of the downloaded data is stored at the proxy node and use that copy, instead of having to upload the downloaded data from the computing device(s) 110. Accordingly, the backup service reduces the amount of data that needs to be uploaded during a backup operation, while still having the data available for restore operations.
The program code (e.g., backup code 155) may be implemented using application programming interfaces (APIs) and related support infrastructure. In an example, the operations described herein may be executed by program code residing on the computing device(s) 110 (e.g., as an “app” on a mobile device), at the backup service 150 (e.g., a separate computer system having more processing capability, such as a server computer 152 or plurality of server computers 152), and/or at the proxy node(s) 160.
Program code used to implement features of the system can be better understood with reference to
a-b show example architectures for reducing backup bandwidth by remembering downloads to a computing device, including executable machine readable instructions. The program code discussed above with reference to
The program code may include the machine readable instructions, and may be structured as self-contained modules. These modules can be integrated within a self-standing tool, or may be implemented as agents that run on top of an existing program code.
In the example shown in
In a first illustration, all downloads are routed through a proxy 250. Rememberer 255 in proxy 250 may remember downloads by computing device(s) 230 for a given time, such as all downloads since the last backup or all downloads during the last 24 hours. Different computing devices 230 may each be assigned to different proxy nodes. Or the same proxy 250 may be used for multiple computing devices 230, with an individual proxy 250 remembering which device downloaded the corresponding data 220a.
The proxy 250 may be provided by an Internet service provider (ISP) for the computing device 230, or as a separate backup provider node. In the case of an ISP, the ISP itself may be providing the backup service 210, or the ISP may be a “middleman” that remembers data for a separate backup service 210. In the case of a non-ISP proxy, the computing device software may fetch information through the proxy.
In the example shown in
In the example shown in
In both of the examples shown in
In both of these examples, the backup service 210 uses the remembered information 270a-b to reduce backup bandwidth. To do this, the backup service 210 associates pieces of the data found in the computing device storage 231 with previously made downloads. There are many ways that this can be implemented. An example is to remember which file name each download is (initially) saved to. At backup time, if a given file has to be backed up because the file has changed since the last backup (e.g., newer modified time), the file name can be compared against the remembered information to see if the file originally resulted from a download, and if so which download.
Another example implementation is to remember a hash of each entire downloads data as part of the remembered information about that download. At backup time, the backup agent 232 can hash each entire changed or new file and check to see if any downloads hash matches that hash. If so, that file includes the downloaded data from that download. A similarity signature may be substituted for the hash here, wherein mostly similar or identical files are likely to have identical similarity signatures, while other files have different similarity signatures. This allows a file to continue to be associated with a download even if it is modified somewhat.
Similarity signatures have been used in other applications. However, similarity signatures have not been used as described herein.
Yet another example implementation involves keeping track at the chunk level, rather than the file level. Here, each file (stored or downloaded) is divided into chunks and information about each chunk (including its hash) is remembered. It is noted that a chunk is a small (e.g., 4-8 KS average size) piece of data. Data may be divided into chunks using landmarks so that local changes tend to change only a few chunks.
When data is downloaded by computing device 230, the data may be chunked and the hashes of the chunks remembered as part of the remembered information about that download. The information about each chunk may include its length and offset in the downloaded data. This allows retrieving the chunk's data from a copy of the downloaded data. At backup time, modified files may be chunked and each of their hashes looked up to see if they are part of any download. Even if a file that was originally downloaded has been modified, many of its chunks may not have been modified. Similarity signatures can be substituted for hashes here as well.
With these methods, pieces of data are found on computing device storage 231 that are associated with recent downloads. In some cases, a data piece is known to be the same as originally downloaded (e.g., hashes match). In those cases, the backup service 210 attempts to retrieve the piece of data without having to upload it from computing device 230. The backup service 210 may do this by attempting to fetch the data from the copy made at proxy 250 when the download occurred (
In some cases, a data piece may not be known to be the same as the originally downloaded data piece (e.g., similarity signatures were used or the file the download was made to is known to have changed due to its modification time). Here, the piece of data resides on computing device 230 and the associated piece of originally downloaded data can usually be retrieved by the backup service 210. While these may be different, the data may not be that different, having only small local changes. To efficiently transfer the piece of data on computing device 232 to backup store 280, the backup service 210 may do a low bandwidth mode deduplication against the piece of data that the backup service is able to retrieve.
Here, both pieces of data are broken up into sub-pieces of data (e.g., a file may be broken up into chunks or large sized chunks broken up into smaller chunks). A hash is computed for each sub-piece of data, and the resulting lists of hashes are compared. Sub-pieces of data on computing device storage 231 that share their hash with a sub-piece of data that is retrievable by backup service 210 need not be uploaded to backup service 210. Instead, these sub-pieces of data can be directly retrieved by backup service 210. The other sub pieces of data on computing device storage 231 can be uploaded from computing device 230. They include data that is not part of the original download. Backup service 210 can then combine all the sub pieces of data that have been acquired to re-create the piece present on computing device storage 231.
To reduce the amount of storage needed, some optimizations may be implemented. For example, when the computing device 230 knows that the data it has downloaded is not being saved, the information remembered about that download may be discarded. This may involve the computing device 230 signaling the backup service 210 or proxy 250 to discard that information, including the copy of the downloaded data, immediately.
In another example, recently downloaded data not seen during the next backup was not saved by the computing device 230, and can have its associated remembered information (including the copy of the downloaded data at a proxy 250, if any) be deleted. It is noted that in the case of multiple devices downloading the same data, any copy of the downloaded data at proxy 250 may be discarded only after it is known that no other computing device 230 using the proxy 250 saved it but has not yet been backed up. Potentially, downloads whose data is known not to be saved by any of the computing devices 230 (except possibly computing devices 230 that have missed the last couple of backups) may have their associated remembered information be discarded as well.
Remembered copies of the downloaded data's chunks (e.g., at proxy 250) not incorporated into a backup (e.g., in backup store 280) may be discarded after every device that downloaded the downloaded data has completed a backup. These chunks were downloaded, but not kept by the computing device 230 or were modified to produce new chunks.
In another example, heuristics may be deployed to discard first remembered information about data thought least likely to be saved. For example, MP3s and PDFs are more likely to be saved than HTML pages, and thus information about downloads of HTML pages may be discarded before information about downloads of MP3 and PDF files.
In a second illustration, the computing device 230 (or the ISP or proxy 250) remembers where data was downloaded from (e.g., URL, any cookies used, etc.) and the hashes, links, and offsets of the chunks that make up the downloaded data. Hashing can be done either on the computing device 230 or a node that the data passes through during a download (e.g., proxy 250). During a backup operation, deduplication is done as usual except that the remembered hash lists are also consulted. If a chunk has a match with a remembered hash only then the backup service 210 uses this information and is given/has the associated information to either try and directly retrieve the download data from the network node 240 and extract the corresponding chunk(s), or extract the chunk(s) directly from the copy made at proxy 250.
It is possible that the retrieval from the network node 240 fails (e.g., non-cookie form of password protection; cookie has expired; SSL being used). It is also possible that the retrieval appears to work, but the returned data at the location of the desired chunk has a different hash because the underlying data at the network node 240 has changed. In either case, the chunk may be uploaded from the computing device(s) 230.
In cases where data requires a current SSL connection for retrieval, the computing device 230 may assist the backup service 210 by opening a new SSL connection through the backup service 210, which the backup service 210 then uses to retrieve the downloaded data. In another example, computing device 230 may be configured to trust not only SSL certificates signed via one of the usual roots of trust (e.g., VERISIGN or DIGICERT), but to also trust certificates issued by the internet service provider (ISP) or the backup provider, such that backup service 210 or proxy 250 may perform a “man-in-the-middle” (MITM) “attack” against computing device 230 and hence access the data (or the identifier for the data and associated authentication information such as cookies) by bypassing the SSL encryption. Although bypassing SSL via a MITM attack may be controversial, and raises some reputational risk for the provider of the backup service, for mobile devices which use exceptionally expensive bandwidth, performing a MITM against SSL may be implemented.
White some data may no longer be retrievable (and hence needs to be uploaded), this illustration (
The illustrations described above may also be combined. For example, data that is hard to retrieve (e.g. SSL, certain dynamically changing websites) may be directly remembered, and data that is easy to retrieve may only be remembered by location and hash(es). Likewise, some files may be remembered at the whole file level, and other files may be remembered at the chunk level. The more likely a file seems to be only partially saved (e.g., saved then partially overwritten or changed), the more that may be remembered at the chunk level.
It is noted that local deduplication may also be implemented, at least at the file level in order to conserve space, and store only a single copy of data at proxy 250 and/or at backup store 280.
The computing device storage 300 also includes a variety of downloaded data. For example, application software 330a may have been downloaded from the Internet or other network site (e.g., an enterprise network) for installation on the computing device. In another example, downloaded data 330b such as videos, music, and PDF files may have been downloaded from the Internet or other network site.
The computing device in this illustration may be associated with an online or cloud backup service 340, which backs up data in the computing device storage 300 in an off-site data store 345 (e.g., in the cloud or at an enterprise data center). Uploading 301 all of the data from the computing device storage 300 to the data store 345 consumes expensive and potentially limited bandwidth that could be used to speed up other network communications, and can slow processes at the computing device during the backup process.
Instead, the backup service 340 may use remembered information about the data stored on the computing device storage 300. This remembered information 350 may be kept by the computing device and includes at least the sources of the downloaded data. For example, the computing device may have downloaded 302 application software 330a and/or downloaded data 330b from network site(s) 360. Accordingly, backup agent 232 remembers that application software 330a and/or downloaded data 330b was downloaded from the network site(s) 360, and therefore does not upload application software 330a and/or downloaded data 330b as part of the backup.
Only data that was not downloaded (e.g., locally provided data 310a, locally installed application software 310b, and locally generated data 310c) is uploaded 301 to the data store 345. In an example, the backup service 340 retrieves the downloaded data 330a-b directly from the source 360 and stores the downloaded data 330a-b in the data store 345 as part of the backup process.
The computing device in this illustration may be associated with an online or cloud backup service 440, which backs up data in the computing device storage 400 at data store 445. Again, uploading 401 all of the data from the computing device storage 400 to the data store 445 consumes expensive and limited bandwidth that could be used to speed up other network communications, and can slow processes at the computing device during the backup process.
Instead, the backup service 440 may use remembered information 450 (e.g., provided by the proxy node(s) 470) about the data downloaded to the computing device storage 400. In this illustration, all downloads 402 to the computing device storage 400 were via the proxy 470. For example, when the computing device downloaded 402 application software 430a and/or downloaded data 430b from network site(s) 460, the proxy 470 remembered information about the downloaded information (e.g., a URL) and/or also stored a copy of that data, e.g., in data store 475 (although the proxy may also be associated with data store 445 of the backup service 440).
Accordingly, the backup service 440 and/or proxy node(s) 470 remembers that application software 430a and/or downloaded data 430b was downloaded via the proxy 470, and therefore does not have to upload application software 330a and/or downloaded data 330b from computing device 230 as part of the backup.
Again, only data that was not downloaded (e.g., locally provided data 410a, locally installed application software 410b, and locally generated data 410c) is uploaded 401 by the backup service 440 to the data store 445 In an example, the backup service 440 retrieves the application software 330 and downloaded data 430b from the proxy 470.
It is noted that the backup service in any of these illustrations (
Although shown separately, the techniques illustrated by
Before continuing, it should be noted that the examples described above are provided for purposes of illustration, and are not intended to be limiting. Other devices and/or device configurations may be utilized to carry out the operations described herein.
a-c are flowcharts illustrating example operations that may be implemented to reduce bandwidth usage of a computing device. Operations may be embodied as logic instructions on one or more computer-readable medium. When executed on one or more processors, the logic instructions cause a general purpose computing device to be programmed as a special-purpose machine that implements the described operations. In an example, the components and connections depicted in the figures may be used.
The operations shown and described herein are provided to illustrate example implementations. It is noted that the operations are not limited to the ordering shown. Still other operations may also be implemented.
a illustrates sub operations 530 and 535. Operation 530 includes remembering information for repeating the download, including a source of the downloaded data. Accordingly, operation 535 may include backing up the computing device by retrieving at least some of the downloaded data from the source of the downloaded data instead of from the computing device.
b illustrates sub operations 540 and 545. Operation 540 includes remembering one or more signatures for one or more pieces of the downloaded data. Accordingly, operation 545 includes backing up the computing device by using the one or more signatures to determine which pieces of data on the computing device are available from a remembered location.
c illustrates sub operations 550-556 Operation 550 includes remembering information for the download by routing the download for the computing device through at least one proxy node. Operation 552 includes storing the downloaded data at or via the at least one proxy node. Operation 554 includes remembering that the downloaded data is stored at or via the at least one proxy node. Accordingly, operation 556 includes backing up the computing device by retrieving some of the downloaded data from the at least one proxy node instead of from the computing device.
The operations may be implemented at least in part using an end-user interface (e.g., web-based interface). In an example, the end-user is able to make predetermined selections to configure the backup operation, and the operations described above are implemented on a back-end device to present results to a user. The user can then make further selections. It is also noted that various of the operations described herein may be automated or partially automated.
It is noted that the examples shown and described are provided for purposes of illustration and are not intended to be limiting. Still other examples are also contemplated.