The present invention generally relates, in a first aspect, to a method for content delivery through a Content Delivery Network (CDN), and more particularly to a method comprising using a tracker for coordinating entities of said Content Delivery Network.
A second aspect of the invention relates to a tracker designed in order to implement the method of the first aspect.
In a P2P network like bittorrent [1] [2], the tracker acts as the central coordinating entity for P2P transfer of files among the requesting end users. In BitTorrent, the tracker serves torrent files to be downloaded from a web site. The tracker maintains information about all clients utilizing each torrent.
In a classic download (typically with HTTP or FTP request), a client connects to the server (that has the content) and the file transfer occurs over a single connection. The BitTorrent protocol differs from classical download in several ways: (a) BitTorrent makes many requests of small blocks of data over different TCP connections to different machines. (b) BitTorrent downloads blocks of file in a “rarest-first” mode. In this case, the rarest pieces of the file among a peer's neighbours are downloaded first. This ensures that if one or more peer leaves the torrent, the rare file blocks remain available for download. In a classical download, the file is downloaded sequentially and all at once [1][2].
Since other clients behave as a server in distributing content, it makes BitTorrent downloads very cost effective for content owners. In addition, the BitTorrent protocol has a greater resistance to flash crowds than serving content from a server on a single connection. On the downside, since the client (or peer) downloads pieces of files from peers at different rates, he will see longer download times compared to a peer downloading a file at a high rate on a single connection.
CDNs have been around for well over a decade. As a consequence, there exist a significant number of CDN designs. However, none of them use a tracker as an element to coordinate between the elements of the CDN. CDN designs rely on using a hierarchy of DNS servers [4] or use HTTP redirection [5][7] as a way of identifying an end point or use a requesting user's location to determine the closest edge location that is best positioned [6] to serve content.
Only CoralCDN [8] is based on P2P architecture that was motivated in part by the original bitTorrent protocol [2]. However, the CDN is based on DHT and is trackerless. Only the original BitTorrent protocol uses the tracker as a central entity that aids peers in data sharing.
Trackers in P2P networks like BitTorrent have been designed to serve two primary purposes: (1) keep track of every active torrent thereby identifying both the network and end users uploading and downloading files. (2) keep track of the fragments of a file that each client possesses, thereby assisting peers in efficient data sharing. When a peer requests content for download, the tracker returns a list of peers that are part of the torrent. The client then connects to the peers and starts downloading the content file. Several P2P tracker designs have been proposed and implemented [2]. However, they are very similar in design and function. The key difference in the implementations lies in how the trackers identify fast peers for file sharing to speed up download times.
In [10], the tracker references information about other peers (who may be associated with different trackers) to come together to form a P2P cloud to speed up content sharing. This is merely a variant of the tracker design in the BitTorrent architecture. Similarly, [11] uses a variety of criteria to find fast peers to speed up content download.
The tracker in the service provider's CDN is designed with different services in mind: The tracker is designed to coordinate all of the various entities in the CDN. In addition, on request, a tracker helps an end point (or content server) identify other end points in its neighbourhood that can help exchange content when needed. Identifying peers to do a P2P content distribution forms only a very small part of the tracker design.
Next, terminology and definitions that might be useful to understand the present invention, and also the proposals cited in the present section are included.
PoP: A point-of-presence is an artificial demarcation or interface point between two communication entities. It is an access point to the Internet that houses servers, switches, routers and call aggregators. ISPs typically have multiple PoPs.
Autonomous System (AS): An autonomous system is a collection of IP routing prefixes that are under the control of one or more network operators and presents a common, clearly defined routing policy to the Internet.
Content Delivery Network (CDN): This refers to a system of nodes (or computers) that contain copies of customer content that is stored and placed at various points in a network (or public Internet). When content is replicated at various points in the network, bandwidth is better utilized throughout the network and users have faster access times to content. This way, the origin server that holds the original copy of the content is not a bottleneck.
ISP DNS Resolver: Residential users connect to an ISP. Any request to resolve an address is sent to a DNS resolver maintained by the ISP. The ISP DNS resolver will send the DNS request to one or more DNS servers within the ISP's administrative domain.
URL: Simply put, Uniform Resource Locator (URL) is the address of a web page on the world-wide web. No two URLs are unique. If they are identical, they point to the same resource.
URL (or HTTP) Redirection: URL redirection is also known as URL forwarding. A page may need redirection if: (1) its domain name changed, (2) creating meaningful aliases for long or frequently changing URLs (3) spell errors from the user when typing a domain name (4) manipulating visitors etc. For the purpose of the present invention, a typical redirection service is one that redirects users to the desired content. A redirection link can be used as a permanent address for content that frequently changes hosts (much like DNS).
Bucket: A bucket is a logical container for a customer that holds the CDN customer's content. A bucket either makes a link between origin server URL and CDN URL or it may contain the content itself (that is uploaded into the bucket at the entry point). An end point will replicate files from the origin server to files in the bucket. Each file in a bucket may be mapped to exactly one file in the origin server. A bucket has several attributes associated with it—time from and time until the content is valid, geo-blocking of content, etc. Mechanisms are also in place to ensure that new versions of the content at the origin server get pushed to the bucket at the end points and old versions are removed.
A customer may have as many buckets as she wants. A bucket is really a directory that contains content files. A bucket may contain sub-directories and content files within each of the sub-directories.
Geo-location: It is the identification of real-world geographic location of an Internet connected device. The device may be a computer, mobile device or an appliance that allows connection to the Internet to an end user. The IP-address geo-location data can include information such as country, region, city, zip code, latitude/longitude of a user.
Operating Business (OB): An OB is an arbitrary geographic area in which the service provider's CDN is installed. An OB may operate in more than one region. A region is an arbitrary geographic area and may represent a country, or part of a country or even a set of countries. An OB may consist of more than one region. An OB may be composed of one or more ISPs. Each region in an OB is composed of exactly one An OB has exactly one instance of Topology Server.
Partition ID: It is a global mapping of IP address prefixes into integers. This is a one-to-one mapping. So, no two OBs can have the same PID in its domain.
Consistent Hashing: This method provides hash-table functionality in such a way that adding or removing a slot does not significantly alter the mapping of keys to slots. Consistent hashing is a way of distributing requests among a large and changing population of web servers. The addition of removal of a web server does not significantly alter the load on the other servers.
MD5: In cryptography, MD5 is a widely used cryptographic function with a 128-bit hash value. MD5 is widely used to test the integrity of the files. MD5 is typically expressed as a hexadecimal number.
DSLAM: A DSLAM is a network device that resides in a telephone exchange of a service provider. It connects multiple customer Digital Subscriber Lines (DSLs) to a high-speed Internet backbone using multiplexing. This allows the telephone lines to make a faster connection to the Internet. Typically, a DSLAM serves several hundred residents (no more than a few thousand residents at the most).
Distributed Hash Table (DHT): Distributed hash table is a class of distributed system that provides a lookup service similar to a hash table (key, value) pairs. Any node can retrieve a value associated with a given key. The responsibility of maintaining the mapping from keys to values is distributed among the nodes in such a way that any change in the set of participants causes minimal disruption.
DHTs are used to build many complex services such as distributed file systems, peer-to-peer file sharing and content distribution systems.
The role of trackers used in P2P data transfer of bittorrent is next described:
A bittorrent tracker [1] is a server that assists communication between peers in the BitTorrent protocol[2]. Under the BitTorrent protocol:
If the tracker is taken offline, the peers will be unable to share the P2P files. More recently, the tracker functionality was decentralized using DHT making the torrents more independent from the tracker.
The requirements of a tracker in a CDN are significantly different from that of a BitTorrent tracker.
It is necessary to offer an alternative to the state of the art that covers the gaps found therein, particularly those related to the lack of proposals providing the requirements a tracker implemented in a CDN needs to have.
Said requirements are:
To address the requested needs, the present invention concerns to a method for content delivery through a Content Delivery Network, which comprises using a tracker for coordinating the entities that make up the infrastructure of said CDN. Said tracker has a CDN layer comprising interfaces for at least part of said entities and a network layer for providing network and communication services to said CDN layer.
Said CDN infrastructure entities are one or more of the following: origin servers, trackers, end points, topology servers, DNS servers and an entry point.
According to a second aspect, the present invention relates to a tracker for content delivery through a Content Delivery Network, that comprises a CDN layer and a network layer to address the method of the first aspect of the invention. The second aspect of the invention is therefore designed to perform the tasks of the first aspect.
Other embodiments of the method of the first aspect of the invention are described in appended claims 2 to 20, and in a subsequent section related to the detailed description of several embodiments. Said embodiments are also valid for describing the tracker of the second aspect of the invention.
The previous and other advantages and features will be more fully understood from the following detailed description of embodiments, with reference to the attached drawings, which must be considered in an illustrative and non-limiting manner, in which:
Next, each component of the CDN service provider's sub-system is described. The infrastructure consists of Origin Servers, Trackers, End Points and Entry Point.
A CDN customer has two options for uploading content. The customer can either upload files into the bucket or give URLs of the content files that reside at the CDN customer's website. Once content is downloaded by the CDN infrastructure, the files are moved to another directory for post-processing. The post-processing steps involve checking the files for consistency and any errors. Only then is the downloaded file moved to the origin server. The origin server contains the master copy of the data.
Next, with reference to
The primary functions of the CDN tracker are coordinating the various CDN elements, helping synchronize information between CDN elements and end points, participating in DNS resolution, identifying least loaded end points that are best positioned to serve content to requesting end users, using current network information to identify the least cost path between a requesting end user and a serving end point. The tracker is also the element that maps content to end points using consistent hashing [1][9].
The tracker detailed in this invention is the entity that enables intelligence and coordination among elements in the CDN infrastructure. The tracker also helps balance the load across all the end points in the OB that deploys the CDN. Generally, there is exactly one tracker deployed per region in an OB. The tracker design consists of two layers: the network layer and the CDN layer (see
The network layer provides transport and communication services to the CDN layer. The transport services are via standard protocols: TCP, HTTP. The tracker also participates in DNS resolution. The CDN layer consists of: Consistent hashing module, Neighbour Management module, load balancer module and the DNS resolution module.
In addition, the tracker has a web services interfaces for communication with the End points and the CDN content manager, Topology server and the DNS server.
The tracker maintains interfaces with the following four entities of the CDN eco-system: end point, CDN manager, topology Server and DNS Server. The communication with each of the CDN entities occurs via RPCs. The RPCs may take any format: XML, binary, json object, REST API call etc with HTTP as a transport mechanism. The interfaces between the tracker and other CDN entities are the following (see
End Point:
The tracker (a) maintains information about content at each end point and (b) collects statistics periodically from each end point. The tracker maintains the following information from each end point: The end point reports the number of outbound bytes, number of inbound bytes between two reporting periods, available free disk space and number of active connections for each bucket. The tracker uses this information to infer the load on an end point.
In response to the end point statistics, the tracker returns a list of active neighbours to the end point. This ensures that at each time, every end point has a fresh set of active neighbours that it can use for P2P communication.
CDN Manager:
Any change in the meta-data of a bucket (or the file in a bucket) by a customer is reflected at the CDN Manager immediately. Since the tracker synchronizes the buckets with the CDN Manager periodically, any change in the bucket meta-data is reflected at the tracker. The tracker also synchronizes with the end points frequently. So, any change in the meta-data of the bucket (or any file in a bucket) at the CDN manager is propagated to the end points in a very short time.
DNS Server:
The tracker gets a file that contains information about regions, called regionsdb from the TLD DNS server.
This information is useful for an end point in order to determine the region of an originating request. If the region of the originating request is not the same as that of the end point, the end point returns a HTTP 302 while encoding the region as part of the URL. When the end user makes a request for the new URL, the TLD DNS server identifies the correct region and forwards the request to the DNS server authoritative for that region.
The regionsdb is also useful in performing geo-blocking of clients from content that may not be viewed from certain locations.
Topology Server:
The tracker fetches information about the partitions (or subnets), pidlocdb and the cost-matrix (called costmatrix) between partitions (or subnets) from the topology server. It gets both the pieces of information periodically.
The interaction of the tracker is summarized as follows, and illustrated in
Tracker and End Points:
(1) allbuckets: This is called to get information about buckets. No information is returned if the buckets have not changed since last request (i.e. no new bucket was created and there was no change in meta-data of any bucket).
end points->tracker: HTTP GET request
tracker->end points: HTTP response
(2) updateNodeStats: Called periodically by end points to report node level statistics (via HTTP POST). In return, a list of active neighbours is piggybacked to the end point.
end points->tracker: HTTP POST
tracker->end points: HTTP response to POST and list of active neighbours of the end point serving the statistics.
(3) updateRegionsdb: Called to get the latest Regiondb. Only new updates are sent rather than the entire database. Every time a new region is created/removed by OBs, the regiondb table is updated. Since end points help resolve DNS requests, the latest regiondb table needs to be propagated to the end points as soon as a new region is created.
end points->Tracker: HTTP GET request
Tracker->end points: HTTP response with a copy of the regiondb.
(4) pidlocdb: Called by an end point to return the PID & IP prefix/mask associated with each region in an OB.
end points->Tracker: HTTP GET request
Tracker->end points: HTTP response with a copy of the pidlocdb
Tracker and DNS Server:
(1) Get regiondb to identify the list of endpoints for a bucket id and geographic information. In case of changes in regiondb, only the updates are sent.
end points->DNS Server: HTTP GET request
DNS Server->end points: HTTP response with a copy of the regiondb.
Tracker and CDN Manager:
(1) allbuckets: Get all buckets from the publication manager that resides at the publication server.
tracker->CDN manager: HTTP GET
CDN manager->tracker: HTTP response
(2) geodb: Get the latest geo-location database from the CDN manager. This is useful in order to ensure that the end points allow requests for content to proceed only if the requesting end user belongs to a region where the content may be shown.
tracker->CDN manager: HTTP GET
CDN manager->tracker: HTTP response with a copy of the geo-location database.
Tracker and Topology Server:
(1) pidlocdb: Get the list of PIDs (partition IDs and the corresponding IP prefixes) from the topology server that maintains the latest PID/IP prefixes pairs for all regions.
tracker->topology server: HTTP GET
topology server->tracker: HTTP response with a copy of the PID location database.
(2) costmatrix: Get the unidirectional cost of transferring data between all PIDs (path between PID in row i and PID in column j for all i and j, where i and j are PIDs). If the path between two PIDs does not exist, the matrix location for such a path contains a negative value that is not considered in calculating the cost.
tracker->topology server: HTTP GET
topology server->tracker: HTTP response with a copy of the cost matrix.
The tracker uses the costmatrix received from the Topology server to determine routing between source and destination (requesting end user) PIDs.
Since all requests to identify the end point that is best positioned to serve content come through the tracker, it is the natural element to balance end user requests across all of the end points.
As per the design of the DNS sub-system in the service provider's CDN the tracker load-balances the requests across end points that are not heavily loaded. This allows the CDN infrastructure to scale with the number of requests. The end points in turn either (a) send a HTTP 302 Redirect message to the requesting end user or (b) identify themselves as best positioned to serve content.
The tracker may load-balance the requests by either one of the following algorithms (a) round-robin, (b) geographic location while giving preference of end points in the same region or (c) any policy that associates content with a small subset of end points (either because of the popularity of the content or because end points are configured serve only certain type of content).
The resource management mechanism is designed to allow the CDN to balance the requests across the CDN's end points. To balance the load, we use consistent hashing.
A key reason to use consistent hashing is that adding a node or taking down a node does not significantly change the mapping of content to end points. In contrast, for traditional hash tables, changing the number of end points causes nearly all the content to be mapped to the end points.
The resource management mechanism at the tracker accomplishes the following: (1) It maps content to end points that are distributed geographically within a country or a region. (2) It maintains a mapping of IP subnet addresses to partition IDs. By identifying the PIDs of the end user, and knowing the PID of the content, the end point knows if the requested content may be served or must be geo-blocked. (3) The end point uses a PID matrix that has weights associated with every pair of PIDs. This allows the resource management mechanism to identify the best PID (and therefore, the subnet) that can serve content. Subsequently, the tracker forwards the request to the end point that has the content in the PID identified in the previous step.
The end point serves as a redirector for a client request. As part of this redirection, the end point needs to identify the PID that may best serve the content. This identification is performed using consistent hashing at the end points.
From time to time, end points may need to be brought down either for maintenance or because they need to be replaced/upgraded. For ease of administration, we provide the CDN administrator, the ability to bring down end point(s). Similarly, we also provide API calls to enable and disable end points.
End points can be disabled at the tracker with an API call. The /api/tracker/policies/disablenodes is called with a JSON object like: {‘disabled_endpoints’: [node0, node1, . . . , nodeN−1]}. Here, node0 to nodeN−1 are a list of IP addresses that need to be disabled by the tracker. A detailed description for disabling an end point is presented in
Prior to disabling an end point, the tracker ensures that (1) no end user is accessing content at the end point (and if they are accessing content, the tracker ensures that the end point finishes processing ongoing requests). (2) The end point is no longer considered to be part of the CDN infrastructure when directing subsequent requests for content from end users to end points.
The corresponding API call to enable endpoints, namely enablenodes is called with a JSON object like: {‘enable_endpoints’: [node0, node1, . . . , nodeN−1]}. Here, node0 to nodeN−1 are a list of IP addresses that need to be enabled by the tracker.
When an end point joins: An end point is handed the address of the tracker as part of the configuration. As part of the initialization, the end point contacts the tracker. The end point keeps an open connection with the tracker. This allows the tracker to know the status of every end point. The end points use this connection to send the node statistics to the tracker periodically.
If the connection closes unexpectedly, the end point will attempt to reconnect with the tracker by opening another connection.
When an end point leaves unexpectedly: If the tracker does not receive statistics update from an end point for a period of time (or the connection with the tracker breaks), it assumes that the end point is no longer part of the CDN infrastructure. As a result, the tracker does not take into account such a node for content distribution (and hence, for consistent hashing or as a neighbour for the other end points).
The tracker is responsible of returning a list of end points that are best positioned to serve requested content to the end user. It is described as part of the DNS resolution process here that deals with returning a list of end points to the end user.
The tracker has a list of parameters for each end point to aid in geo-location. These parameters are:
IP address: The tracker infers the geographic location of the end point using its IP address and the mask.
Site ID: This provides better location information. A tracker may use the Site ID to determine if two end points may use exchange data using P2P protocol. Within the same datacenter, a CDN service provider may label cluster of machines on different floors as having different site IDs (network connectivity between floors may vary).
PID: The tracker may determine the PID of the end point using the pidlocdb database to infer the partition ID and then use the Site ID to infer if two machines are really co-located.
The tracker also has access to Geo-IP database (called geodb) that it can be used to identify the location of an IP address (end points). The IP address, together with the geodb helps the tracker resolve an end point when needed.
While a very fine-grained Geo-IP database may resolve an IP address at the block level, using Site ID we are able to resolve the location of a cluster of machines within a datacenter. This gives our tracker better resolution when identifying geo-located machines. Note that we may use PID database instead of a Geo-IP database without compromising on the accuracy of geo-location.
In addition, the tracker maintains the following information about each end point (this information is reported by each end point every minute or every few minutes):
These parameters allow the tracker to infer what end points may be regarded as busy. Since end points report their parameters every 30 seconds (or few minutes), the tracker always has the latest information for every end point. Individual CDN service providers may use a combination of the above parameters to decide what constitutes a busy end point. The tracker when responding to end user requests does not use end points identified as busy.
As part of the DNS resolution request, the tracker must find end points that are geographically close to the requesting end users.
In describing the DNS resolution, the following assumptions are made: The end user has made a request for video01.fly that generates a request to the CDN of the format bucket_id.t-cdn.net/bucket_id/video01.fly. Using a bucket_id=87, the request is of the form b87.t-cdn.net/87/video01.fly.
If the end user fails to connect to the end point BCN4, the end user tries to connect to BCN2, MAD2 and GLB2 in that order.
The Tracker has a neighbour manager module. When the tracker sends a list of neighbours to a requesting end point, it orders the end points (neighbours as follows):
First, the tracker orders the end points by IP addresses (or IP prefixes). So, end points returned are part of the same datacenter. The tracker may also use PID and/or Geo-IP database to infer this information.
For the set of IP addresses that belong to the same prefix, but different site IDs, it orders the neighbours by site ID.
The set of IP addresses received by an end point are then used to engage in P2P communication when sharing content between end points in the same datacenter.
The service provider may need to implement a number of policies in the CDN. The tracker at the CDN is the best placed to both implement and police the implementation of these policies. The policies that may be implemented are:
As seen above, the tracker is the most appropriate place to implement and police the policies of the CDN service.
a) Enabling and disabling end points.
b) Reserving content buckets to reside on specific end points (and no others).
c) Reserving end points to serve a type of content (e.g. serve only live content from and end point).
A person skilled in the art could introduce changes and modifications in the embodiments described without departing from the scope of the invention as it is defined in the attached claims.
Number | Date | Country | Kind |
---|---|---|---|
P201130757 | May 2011 | ES | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2012/058362 | 5/7/2012 | WO | 00 | 3/12/2014 |