The present invention relates to a method or algorithm for differentiating between browser links (or URL's) visited on a page versus those embedded which are simply embedded on a given Web site.
Web Browsing has become a part of every-day life. At work one may use a Web Browser to access e-mail, interact with customers, or look up information on the Internet. Children use the Web and thus Web Browsers to review assignments from class, turn in homework, or simply socialize with their friends. In the home, people use Web Browsers to read news, manager bills, or plan a vacation, among other uses.
The effectiveness of Web based advertising is an important question with significant economic implications. Businesses such as Google have been extremely successful based on Web based advertising models. In the Prior Art, it was relatively straightforward to count the number of times a specific web page had been downloaded to a device. Counting the number of times a specific web page had been downloaded may be accomplished using techniques that prevent web pages from being cached, effectively allowing the server to count every time the page is downloaded (referred to as “hits”). But, if there are references to a web site embedded into other web sites, the question remains, how many of these “hits” are counted because a user requested the URL (Universal Resource Locator) to be downloaded or whether the URL was merely present in another web page. Prior Art techniques for counting “hits” may thus be inaccurate, and advertisers may be charged improperly for advertising services. For businesses to understand the value of using embedded links for advertising, it would be valuable to know how frequently URLs presented to users are visited.
Another problem with Prior Art web browsing relates to parental monitoring of Web usage. Many web sites, while they themselves may be harmless, may include embedded links that may not be appropriate for children. In the Prior Art, parents may be able to block specific web sites using parental blocking software or services. However such blocking software may block entire websites only, and thus preventing access to web pages with acceptable content for children, as well as more objectionable material. For example, research and encyclopedia sites may contain web pages with information that a child may wish to access to complete a homework assignment or paper. However, links within such pages may lead to other pages with objectionable images or adult content. It would be useful to allow a child to selectively visit a page with non-objectionable material, even if the page contains links to objectionable material, while at the same time blocking links to the objectionable material pages. It would also be useful to parents to know if a particular web page was actually selected by the user, or if it was downloaded only because that particular page was referenced by an embedded URL.
For businesses to understand the value of using embedded links for advertising, it would be valuable to know how frequently URLs presented to users are visited. The present invention provides a method and algorithm for determining if a URL was simply presented to the user or if it was actually visited by the user. The power of this method is, given a few pieces of data, a determination can be made whether the user actually clicked the link rather than just had it show up because they visited a site.
With regard to parental monitoring of Web usage, the algorithm and method of the present invention may be used in an application to provide information to parents indicating whether a particular web page was actually selected by the user or if it was downloaded only because it was an embedded URL. This information may also be used within parental blocking software to allow access to web pages that may contain content appropriate for children, while blocking links on such pages which may lead to inappropriate material.
The present invention includes a method and apparatus for differentiating between browser links (or URL's) actually visited on a page versus those links where are simply embedded on a given Web site. Embedded URL's are downloaded simply because they exist on an accessed page, not because they have been specifically requested by the browser user (examples of embedded URL's include but are not limited to images, ads, style-sheets, and the like). In particular, the present invention is directed at classifying browser links for data mining, security, and other purposes.
The method of the invention uses existing browser histories and packet processing to determine the reason the web browser is accessing the requested URL. The result of this classification may be used for different purposes, such as saving URL history and classification for later upload to a server, or for blocking of URL loading and/or display on a user device.
The method or algorithm for classifying downloaded links or URLs is based on the reason behind the download. Downloads are classified into categories, for example, a “visited” URL or an “embedded” URL. Categorizing these downloads allows other applications to collect information for storage, upload, or other action. The algorithm of the present invention uses information from the browser history and packet streams to obtain and categorize the links or URL's for classification.
For the purposes of this description, a “requested” URL is defined as any URL being accessed through an HTTP (Hyper-Text Transfer Protocol) request from the web browser. A “visited” URL is the actual URL being visited by the user. An “embedded” URL is any URL that is requested while loading a visited URL, for example, images, ads, or style-sheets.
HTTP requests contain two descriptive fields used in the classification algorithm. The first of these fields is the “Host” field. This field is required in an HTTP request and gives the address that is hosting the current requested URL. The second of these fields is the “Referer” field, which is the address that referred the browser or user to the current requested URL. The “Referer” field is optional in HTTP requests.
The algorithm of the present invention classifies the request into either a “visited” URL or “embedded” URL using these fields and allows for storage into one or more databases. These databases can be remotely or locally located and can take many different forms. The database for “visited” URL's is represented by component 350 of
Packets received on a device implementing this algorithm are intercepted in a device specific manner. Packets may be analyzed directly or duplicated and provided to the algorithm (component 330 of
If the requested URL is not first, as determined by step 410, then the domain is compared against the “stored domain” in step 420. If the domains are the same, and the requested URL is not in the browser history as determined in step 440, then it is determined that the requested URL is an “embedded” URL and database 340 may be updated. If the requested URL is in the browser history, as determined in step 440, then the requested URL is classified as a “visited” URL in database 350.
If the domain of the requested URL is different from the “stored domain”, as determined in step 420, then the optional “Referer” field may be examined in step 450. If the “Referer” field does not exist in the HTTP request, and the requested URL appears in the browser history, as determined in step 460, then this is classified as a “visited” URL and database 350 is updated. If the “Referer” field doesn't exist in the HTTP request, as determined by step 450, and the requested URL is not in the browser history, as determined in step 460, then this URL is classified as an “embedded” URL and database 340 is updated.
If the “Referer” field exists in the HTTP request, as determined in step 450, then the domain of the referer (the “referer domain”) is compared against the “stored domain” in step 470. If they are the same, and the requested URL is in the browser history, then this is classified as a “visited” URL and database 350 is updated. If the “stored domain” and the “referer domain” are the same, as determined in step 450, but the requested URL is not in the browser history, as determined in step 470, then the URL is classified as an “embedded” URL and database 340 is updated.
Referring to
Referring back to
Referring back to
The examples illustrated in
For parental control or other type of access restriction software, the algorithm may be used to allow a user to access a page with an embedded URL, which may be on a blacklist, but prevent the user from visiting the page on the blacklist. As the user browses the web, the URLs are classified according to the algorithm 330. If a URL is determined to be an embedded URL 340, the user's access to a page with that embedded URL may be allowed. However, if the URL is a visited (or attempted visit) to a blacklisted URL (determined by comparing the visited URL database 350 with a predetermined blacklisted database 350) then access to such a database may be denied or logged. In addition, the present invention may be used by web crawlers or the like to determine whether a blacklisted URL is embedded in another web page, in order to determine whether additional web pages should be black-listed.
While the preferred embodiment and various alternative embodiments of the invention have been disclosed and described in detail herein, it may be apparent to those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope thereof.