1. Field of the Invention
The present invention relates to computers and computer networks. More particularly, the invention relates to profiling an Internet endpoint associated with an Internet Protocol (IP) address.
2. Background of the Related Art
Profiling what users are doing on the Internet at a global scale, e.g., which applications and protocols users use, which sites the users access, and who the users try to talk to, are intriguing and important questions for a number of reasons. For example, the profiling results can reveal regional characteristics of cultural and behavioral patterns, important user usage pattern trends, potential exploitation of security vulnerabilities, early indication of user acceptance of a new product or service, etc. The profiling results can be used for various purposes such as strategic development, product/service marketing, network traffic engineering, security enhancement, etc.
The most common way to answer the above questions is to analyze network traces. However, the access issues to network traces at a global scale and the processing power required for analyzing network traces in large volume result in the inapplicability of state-of-art packet-level traffic classification tools for this scenario.
The Internet is composed of machines (e.g., computers or other devices with Internet access) associated with IP addresses for identifying and communicating with each other on the Internet. The Internet and the IP addresses are well known to those skilled in the art. These machines are called endpoints on the Internet. Internet endpoints may act as a server, a client, or a peer in the communication activity on the Internet. In vast majority of scenarios, information about servers such as the IP address are publicly available for user to access. In peer-to-peer (p2p) based communication, in which all endpoints can act both as clients or servers, the association between an end point and the p2p application becomes publicly visible. Even in the classical client-server communication scenario, information about clients such as website user access logs, forums, proxy logs, etc. also stay publicly available. Given that many forms of communication and various endpoint behaviors do get captured and archived, enormous amount of information valuable for profiling or characterizing endpoint behavior at a global scale is publicly available but has not been systematically utilized for such purpose.
In general, in one aspect, the present invention relates to a method of profiling an Internet endpoint associated with an Internet endpoint domain name, the method includes generating a profiling rule using an Internet search engine, obtaining a search result by inputting the Internet endpoint domain name to the Internet search engine, and classifying the Internet endpoint based on the search result using the profiling rule.
In general, in one aspect, in the method, generating the profiling rule includes obtaining a seed search result by inputting a seed set to the Internet search engine, the seed set comprising a plurality of randomly chosen Internet endpoint identifiers, the seed search result comprising a plurality of hit texts and a Uniform Resource Locator (URL) associated with a hit text of the plurality of hit texts, the plurality of hit texts comprising a plurality of phrases, the hit text comprising a phrase of the plurality of phrases, the Internet endpoint identifiers comprising at least one selected from a group consisting of an Internet Protocol (IP) address, a domain name, and an IP prefix, ranking the plurality of phrases to generate a rank of the phrase based a count of the phrase in the plurality of hit texts, adding the phrase to a key phrase list if the count exceeds a pre-determined threshold, assigning a URL class to the URL if the phrase is added to the key phrase list, the URL class being determined from the phrase based on semantics, determining a IP tag associated with the URL class, the IP tag being determined from the URL class and the phrase based on semantics, and associating the phrase with the URL class in the profiling rule.
In general, in one aspect, the present invention relates to a computer readable medium, embodying instructions executable by the computer to perform method steps for profiling an Internet endpoint associated with an Internet Protocol (IP) prefix, the instructions comprising functionality for generating a profiling rule using an Internet search engine, obtaining a search result by inputting the IP prefix to the Internet search engine, and classifying the Internet endpoint based on the search result using the profiling rule.
In general, in one aspect, in the computer readable medium, generating the profiling rule includes obtaining a seed search result by inputting a seed set to the Internet search engine, the seed set comprising a plurality of randomly chosen Internet endpoint identifiers, the seed search result comprising a plurality of hit texts and a Uniform Resource Locator (URL) associated with a hit text of the plurality of hit texts, the plurality of hit texts comprising a plurality of phrases, the hit text comprising a phrase of the plurality of phrases, the Internet endpoint identifiers comprising at least one selected from a group consisting of an Internet Protocol (IP) address, a domain name, and an IP prefix, ranking the plurality of phrases to generate a rank of the phrase based a count of the phrase in the plurality of hit texts, adding the phrase to a key phrase list if the count exceeds a pre-determined threshold, assigning a URL class to the URL if the phrase is added to the key phrase list, the URL class being determined from the phrase based on semantics, determining a IF tag associated with the URL class, the IP tag being determined from the URL class and the phrase based on semantics, and associating the phrase with the URL class in the profiling rule.
In general, in one aspect, the present invention relates to a for profiling an Internet endpoint associated with an Internet endpoint identifier. The system includes a hardware processor, a profiling rule generator executing on the hardware processor and configured to generate a profiling rule using an Internet search engine, and a profiler operatively coupled to the profiling rule generator, executing on the hardware processor, and configured to obtain a search result by inputting the Internet endpoint identifier as a search phrase to the Internet search engine and classify the Internet endpoint based on the search result using the profiling rule.
Other aspects and advantages of the invention will be apparent from the following description and the appended claims.
Specific embodiments of the invention will now be described in detail with reference to the accompanying figures. Like elements in the various figures are denoted by like reference numerals for consistency.
In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a more thorough understanding of the invention. In other instances, well-known features have not been described in detail to avoid obscuring the invention.
In one or more embodiments of the invention, the Internet endpoint identifier (101) is an IP address, or a portion thereof, of the Internet endpoint. Generally, the IP address of the Internet endpoint is interpreted as composed of two parts: a network-identifying prefix (i.e., IP prefix of the Internet endpoint) followed by a host identifier within that network.
In one or more embodiments of the invention, the Internet endpoint identifier (101) is an IP prefix of the Internet endpoint. For example, the IP prefix of the Internet endpoint may be determined by correlating the IP address of the Internet endpoint with BGP (Border Gateway Protocol) or WHOIS protocol to find a longest matching prefix. In one or more embodiments, BGP routing tables are used to find the longest matching prefix corresponding to an IP-address. In one or more embodiments WHOIS is used to find the longest prefix registered by an entity, which matches this IP-address.
In one or more embodiments of the invention, the Internet endpoint identifier (101) is a domain name of the Internet endpoint. The Internet endpoint domain name may be a hostname, a top level domain name, a second level domain name, a registered domain name provide by domain name registrars, or other forms of domain names known to one skilled in the art. For example, the Internet endpoint domain name may be determined by performing a reverse domain name service (DNS) query using an IP address of the Internet end point.
An example of Internet endpoint identifier (101) used for illustration below includes 200.101.18.182. An example of Internet search engine (102) that may be used in system (100) includes Google™ search engine, which is a product by Google Corporation. An Internet search engine typically includes the following: (1) Crawler, which collects a snapshot of all the web-pages in at least a portion of the Internet (e.g., World Wide Web), (2) Indexer, which uses the snapshot collected by the crawler to build a reverse-index, defined as a mapping of phrases or words to the web-pages they occur in, (3) Search Infrastructure, where a search query (defined as a combination of phrases or words via operators such as OR, AND, NOT, etc.) is then resolved by making use of the reverse-index. All the documents that match a query are then ranked via pre-determined criteria such as popularity of the web-page, relevance of the query with respect to the web-page in terms of where the query phrases/words appear in the document, etc. Finally, all the web-pages matching a query are returned to the user, where for each result, the user sees the URL for the web-page and the hit-text defined as a portion of the web-page text with the best match with the query.
In the system (100), the Internet search engine (102) receives the Internet endpoint identifier (101) and generates a search result (103), which may be input into the Internet endpoint profiler (104) for generating the IP tag (108) that characterizes the Internet endpoint corresponding to the Internet endpoint identifier (101). The Internet endpoint profiler (104) uses the search result (103) received from the Internet search engine (102) to configure a website cache (108). The Internet endpoint profiler (104) also includes a rapid match module (105) and an IP tagging module (107). As shown in
Further as shown in
The hit text may include one or more phrases (e.g., a word, bi-word, or other word combination forming a phrase), from which a key phrase list (104a) may be formed based on a ranking scheme. The website cache (108) has multiple entries. Each website cache entry may include a search result domain name (e.g., domain name (108a)) and an associated key phrase (e.g., key phrase (108b)) from the key phrase list (104a). More details of generating the key phrase list (104a) to form a profiling rule and configuring the website cache (108) are described in more detail in reference to
Example IP prefixes containing the IP address “200.101.18.182” include “200.101.18.0/24” containing total 28 IP addresses, “200.101.0.0/16” containing total 216 IP addresses, “200.0.0.0/8” containing total 224 IP addresses, etc. For example, if the IP prefix “200.101.18.0/24” is used as the search phrase, the collection, or a portion thereof, of the 28 IP addresses “200.101.18.0” through “200.101.18.255” are used as search phrases and inputted into the search engine one at a time to obtain multiple search result entries in the search result (103).
If the domain name does not contain a phrase in the key phrase list in the comparison of Step 222, the domain name is looked up, for example in a list, table, cache, or other suitable data structure (e.g., website cache (108)). If the domain name is not found in any entry of the data structure (Step 224), a new entry containing the domain name is added to the data structure (Step 226). This new entry may start out with a null value as the key phrase. The Internet endpoint may not be classified at this point. A counter may be initialized for tracking additional Internet endpoint identifiers inputted into the Internet search engine that also come up with the same domain name in the corresponding search results. Once the occurrence of this domain name exceeds a pre-determined threshold, the value of this new entry is determined, to complete the new entry, based on all related phrases associated with this domain name from these search results (Step 228). Accordingly, the previous Internet endpoint, which produced this domain name in the search results may now be classified by the completed entry in the data structure (e.g., website cache (108)) (Step 225).
While the key phrases in these exemplary search results are determined based on specific method steps described with respect to
Exemplary Internet endpoints profiling has been conducted over a large collection of associated IP addresses. TABLE 2 shows the networks in three geographical area: Asia (China), South America (Brazil), North America (US), and Europe (France). The Asian and South American internet service provider (ISP) network studied serve the IP addresses in the /17 and /18 range, while the North American and European ISP network studied serve larger IP address range. The “/XX” notation used here represents “232-XX” and specifies an IP prefix including 2XX IP addresses. Some of the IP ranges in TABLE 2 are anonymized for privacy reasons.
An exemplary website cache (e.g., described with respect to
An exemplary key phrase list (e.g., described with respect to
Although the method of Internet endpoint profiling described above does not require network traffic traces, available network traces allow other classification methods requiring network traces to be compared to the method of the present invention. A graphlet based approach for classifying network traffic known as “BLINC” to those skilled in the art is described iri T. Karagiannis et al., “Multilevel Traffic Classification in the Dark,” ACM SIGCOMM, 2005. TABLE 5 shows exemplary comparison for the South American region comparing profiling results using the method of the present invention against that of BLINC based on available network traces. In TABLE 5, URL classes are listed in the leftmost column using notations described in the bottom row of TABLE 5. Profiling results based on the entire network trace is listed under the heading “Pkt. trace”. Profiling results based on the one percent sampled entire network trace is listed under the heading “1:100 Sampled trace”. In each case, total number of endpoints classified in each URL classes is listed under the heading “Tot.” and is further broken down into three categories listed under the headings “B∩U”, “B-U”, and “U-B”. The notation “B∩U” represents endpoints classified by both BLINC and the method of the present invention. The notation “B-U” represents endpoints classified by BLIC but not by the method of the present invention. The notation “U-B” represents endpoints classified by the method of the present invention but not by BLNC.
Furthermore, although the method of Internet endpoint profiling described above does not require network traffic traces, the method can be applied to classify network traffic traces for a pre-determined region, for example Asia, South America, etc. First, top-ranked IP addresses determined based on a pre-determined criteria (e.g., top 5% of IP address ranked by traffic flow distribution of the Internet such as associated traffic flows of the South American region) are tagged using the method described above to generate a collection of IP tags. Based on a study using available network traffic traces in South American region, it is discovered that the majority of all IP network traffic relates to top 5% of IP addresses. Next, a set of IP tags is selected or otherwise identified from the collection of IP tags based on semantics. For example a set of server tags can be identified by considering IP tags from the collection relating to server activities. Exemplary server tags are shown in the left column in TABLE 6. This concentrated distribution in the empirical statistics is then used to effectively classifying traffic flows based on this relatively small number of server tags. For example, two endpoints (i.e., source endpoint and destination endpoint) for each traffic flow trace to be classified are tagged and compared with the server tags. The traffic flow trace can be classified if either of the two endpoints, when tagged, matches any of the server tags. Due to the concentrated distribution of the empirical statistics, in majority of cases the traffic flow trace can be classified based on semantics according to a server tag matching the tag of either of the two endpoints. TABLE 6 shows exemplary server tags and corresponding traffic classifications (shown in the right column in TABLE 6). For example, if an endpoint of the traffic flow trace is identified as being tagged with “website”, the traffic flow trace is classified as “Browsing”.
It will be understood from the foregoing description that various modifications and changes may be made in the preferred and alternative embodiments of the present invention without departing from its true spirit. For example, although the examples given above relates to a TCP/IP or an OSI network data model and Google™ search engine, the invention may be applied to other network data model and/or Internet search engines known to one skilled in the art. Furthermore, the scope of the hit text may be supplemented by variations of the examples described or include subset or superset of the examples given above, the method may be performed in a different sequence, the components provided may be integrated or separate, the devices included herein may be manually and/or automatically activated to perform the desired operation. The activation (e.g., applying the seed set, generating the profiling rule, inputting the IP address of the endpoint, classifying the endpoint, etc.) may be performed as desired and/or based on data generated, conditions detected and/or analysis of results from the network traffic.
Embodiments of the invention may be implemented on virtually any type of computer regardless of the platform being used. For example, as shown in
Further, those skilled in the art will appreciate that one or more elements of the aforementioned computer system (400) may be located at a remote location and connected to the other elements over a network. Further, embodiments of the invention may be implemented on a distributed system having a plurality of nodes, where each portion of the invention (e.g., various modules of
This description is intended for purposes of illustration only and should not be construed in a limiting sense. The scope of this invention should be determined only by the language of the claims that follow. The term “comprising” within the claims is intended to mean “including at least” such that the recited listing of elements in a claim are an open group. “A,” “an” and other singular terms are intended to include the plural forms thereof unless specifically excluded. While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. Accordingly, the scope of the invention should be limited only by the attached claims.
This application is a continuation in part application of U.S. patent application Ser. No. 12/104,723 filed Apr. 17, 2008 and entitled “System and Method For Internet Endpoint Profiling,” now issued as U.S. Pat. No. 8,019,764 to which this application claims benefit.
Number | Name | Date | Kind |
---|---|---|---|
6721269 | Cao et al. | Apr 2004 | B2 |
7801896 | Szabo | Sep 2010 | B2 |
20070061266 | Moore et al. | Mar 2007 | A1 |
Entry |
---|
Louis Plissonneau, Jean-Laurent Costeux, & Patrick Brown, “Analysis of Peer-to-Peer Traffic on ADSL” Pam2005, LNCS 3431, pp. 69-82, 2005. |
Anthony McGregor, Mark Hall, Perry Lorier & James Brunskill, “Flow Clustering Using Machine Learning Techniques”. |
Andrew Moore & Konstantina Papagiannaki, “Toward the Accurate Identification of Network Applications”. |
Peter Cheeseman & John Stutz, “Bayesian Classification (Autoclass): Theory and Results”. |
Laurent Bernaille, Renata Teixeria & Kave Salamantian “Early Application Identification”. |
Google Webmaster Help Center: Google 101: How Google Crawls, indexes and Serves the web. |
Alan Mislove, Massimiliano Marcon, Peter Druschel, Bobby Bhattacharjee, & Krishna Gummadi “Measurement and Analysis of Online Social Networks”. |
Jeffrey Erman, Martin Arlitt & Anirban Mahanti, “Traffic Classification Using Clustering Algorithms”. |
M. Patrick Collins, Tomothy Shimeall, Sidney Faber, Jeff Janies, Rhiannon Weaver, Markus Shon “Using Uncleanliness to Predict Future Botnet Addresses”. |
Matthew Roughan, Subhabrata Sen, Oliver Spatscheck, Nick Duffield “Class-of-Service Mapping for QoS: A Statistical Signature-based approach to IP Traffic Classification”. |
Harsha Madhyastha, Tomas Isdal, Michael Piatek, Colin Dixon, Thomas Anderson, Arvind Krishnamurthy, Arun Venkataramani “iPlane: An Information Plane for Distributed Services”. |
Patrick Haffner, Subhabrata Sen, Oliver Spatscheck, Dongmei Wang “ACAS:Automated Construction of Application Signatures”. |
Thomas Karagiannis, Konstantina Papagiannaki, Nina Taft & Michalis Faloutsos “Profiling the End Host”. |
Subhabrata Sen, Oliver Spatscheck, Dongmei Wang “Accurate, Scalable In-Network Identification of P2P Traffic Using Application Signatures”. |
Patrick Verkaik, Oliver Spatscheck, Jacobus Van Der Merwe & Alex C. Snoeren, “Primed: Community-of-interest-based DDOS mitigation” AT&T Lab Research, UC San Diego. |
Yinglian, Xie, Fang Yu, Kannan Achan, Eliot Gillum, Moises Goldszmidt, & Ted Wobber, “How Dynamic are IP Addresses?” Microsoft Resarch, Silicon Valley. |
Jian Liang, Rakesh Kumar, Yongjian Xi, & Keith Ross, “Pollution in P2P File Sharing Systems”. |
Thomas Karagiannis, Konstantina Papagiannaki & Michalis Faloutsos, “BLINC: Multilevel Traffic Classification in the Dark”. |
Thomas Karagiannis, Andre Broido, Nevil Brownlee, KC Claffy, & Michalis Faloutsos, “Is P2P Dying or Just Hiding?”. |
Jianning Mai, Chen-Nee-Chuah, Ashwin Sridharan, Tao Ye, Hui Zang,“Is Sampled Data Sufficient for Anomally Detection?”. |
Patrick McDaniel, Subhabrata Sen, Oliver Sptascheck, Jacobus Van Der Merwe, Bill Aiello, Charles Kalmanek, “Enterprise Security: A Community of Interest Based Approach”. |
Andrew Moore& Denis Zuev, “Internet Traffic Classification Using Bayesian Analysis Techniques”. |
“Unconstrained Endpoint Profiling (Googling the Internet)” Sigcomm 2008 Submission #103, 14 Pages. |
Number | Date | Country | |
---|---|---|---|
Parent | 12104723 | Apr 2008 | US |
Child | 13072173 | US |