This invention relates to servers for generating a feed of updating content, to corresponding methods of generating such feeds, and corresponding apparatus and software.
There are many websites on the world wide web that publish feeds of updating content via RSS (or similar). These feeds let internet users, or third party services, automatically monitor them as a simple method of detecting when new content is available. However, there are many other websites that do not publish convenient RSS feeds, yet still contain updating information that internet users, or third party services, would wish to keep up to date with.
There are a number of existing services that attempt to alleviate this problem. For example, there are web services that allow users to configure one or more URLs that will be periodically checked for new or changed content. When a change has been detected, the user is notified of the change, typically, either by email or via an RSS feed populated with the new contents as they are detected. Examples of this kind of service include http://pape2rss.com and http://www.infominder.com. There are other services available that take this a step further and allow the users to specify filters or specific fields within a web page to limit how much of the page is monitored for change. An example of this kind of page filtering can be found at http://femtoo.com.
All of these services only go as far as monitoring the page itself for change, and notifying the users of the changes to the page, sometimes additionally conveying information about what that change was. The present applicants have recognised the need for an improved method and system for creating feeds of updating content.
The present invention provides a system and method for the easy creation of feeds of updating content. In the first aspect the user is able to control which sub-regions of a target web page are monitored for change. In the second aspect, the articles provide content in the feed which is rich and relevant. Both aspects may be combined to provide a rich and relevant feed about user selected elements.
The updates may be sent to a user device, direct or via a publishing server. The user device may be a mobile device which may be any kind of mobile computing device, including laptop and hand held computers, portable music players, portable multimedia players, mobile phones. On such device, the display screen may have limited space. The present invention addresses this problem in two ways, first by restricting the update to a sub-region of the target web page and secondly by outputting an article based on an updated link within the sub-region. Such an article contains a summary of the updated link not simply the full webpage of the link. Both the specification of the sub-region and the template for the summary can be user-defined which permits a better user experience.
According to a first aspect of the invention, there is provided a system and method for monitoring changes to a target web page.
The system is configured to display said target web page to a user; receive a user specification of at least one sub-region within said displayed target web page; download, at a subsequent time, said target web page associated with said user specification; identify said at least one sub-region of said user specification within said target web page; determine whether or not there have been any changes to said at least one sub-region, and if there are any changes, output an update comprising data from said at least one sub-region.
The system may comprise a region selection tool which is configured to carry out the displaying and receiving steps. The system may comprise a target page crawler which is configured to carry out the identifying, determining and outputting steps. The region selection tool may receive the user specification from a user and may output the user specification to the target page crawler. Alternatively, the region selection tool may output the user specification to a specification database and the target page crawler may access the user specification from the specification database.
The system may further comprise an article crawler which is configured to download a new web page associated with said new link; generate an article derived from said new web page; and output said article as said update. The region selection tool may be configured to display said new web page to a user; and receive a user defined template of an article to be based on said new web page from said user. The region selection tool may output the user defined template to the article crawler for the generation of the article.
According to a second aspect of the invention, there is provided a system and method for providing articles containing updates about links within a sub-region of a target web page.
The system is configured to access a user sub-region specification specifying a target web page and at least one of said sub-regions within said target web page; download said target web page associated with said user specification; identify said at least one sub-region of said user specification within said target web page; determine whether or not there is a new link within said at least one sub-region, and if there is a new link, download a new web page associated with said new link; generate an article derived from said new web page; and output said article as an update.
The system may comprise a target page crawler configured to carry out the accessing, downloading, identifying and determining steps and an article crawler which is configured to carry out the downloading of the new web page, generating and outputting steps. The target page crawler may output the new link to the article crawler.
According to the combined aspect of the invention, there is provided a system which is configured to display said target web page to a user; receive a user specification of at least one sub-region within said displayed target web page; download, at a subsequent time, said target web page associated with said user specification; identify said at least one sub-region of said user specification within said target web page; determine whether or not there is a new link within said at least one sub-region, and if there is a new link, download a new web page associated with said new link; generate an article derived from said new web page; and output said article as an update.
The system may comprise region selection tool which is configured to carry out the displaying and receiving steps, a target page crawler configured to carry out the downloading, identifying and determining steps and an article crawler which is configured to carry out the downloading of the new web page, generating and outputting steps.
In each embodiment of the invention, the region selection tool, target page crawler and article crawler may be implemented as modules on a single server or a plurality of interconnected servers. One or more of the modules may be provided on a user device.
The invention further provides processor control code to implement the above-described systems and methods, for example on a general purpose computer system or on a digital signal processor (DSP). The code is provided on a physical data carrier such as a disk, CD- or DVD-ROM, programmed memory such as non-volatile memory (eg Flash) or read-only memory (Firmware). Code (and/or data) to implement embodiments of the invention may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as C, or assembly code. As the skilled person will appreciate such code and/or data may be distributed between a plurality of coupled components in communication with one another.
The invention is diagrammatically illustrated, by way of example, in the accompanying drawings, in which:
a is a schematic block diagram of the components of one arrangement of the system;
b is a schematic block diagram of the components of one arrangement of the system;
Furthermore, as shown in
The overall topology of the components of the system of the invention is illustrated in
The system also comprises a target page crawler 26 which shown as connected to the sub-region specification database 22. The target page crawler 26 may thus access sub-region specifications from the database, e.g. in an automated manner. Alternatively, the target page crawler 26 may be connected to the region selection tool 20 to receive the sub-region specification direct. As described in more detail with reference to
The target page crawler may be termed a page-region monitor tool. This is a service that periodically monitors one or more web pages and sub-region specifications. The service could be deployed as software running on servers in support of many users, each with several different web pages to monitor. Alternatively, the service could be run on the personal computer of the user, either when activated by the user, or automatically in the background on a periodic basis.
The target page crawler 26 is also connected to an article crawler 30 to which it passes new links. The article crawler 30 also comprises a crawler service which crawls the contents of the identified new links, generates new items (articles) in the feed from the content of the crawled links as described in more detail with reference to
This system can be formed of many servers and databases distributed across a network, or in principle they can be consolidated at a single location or machine. In the arrangement of
A plurality of users connected to the Internet via desktop computers 12 or mobile devices 10 can receive a feed from the feed publisher. The users receiving a feed (‘mobile users’) on mobile devices may alternatively be connected to a wireless network managed by a network operator, which is in turn connected to the Internet via a WAP gateway, IP router or other similar device (not shown explicitly). In the arrangement of
b also shows an additional component, a crawler monitor 50, which is provided to monitor the outputs of the target page crawler or the article crawler to detect breakages. A breakage might occur if the target web pages were to change in format or structure, thus potentially invalidating the region specification (e.g. an XPath) for monitoring. This crawler monitoring component could be used to notify (e.g. by email or SMS) the maintainers of the target page crawler or the article crawler and/or the user.
Alternative to a mouse-driven interface for the tool could be a touch-driven interface for use e.g. on a tablet or mobile phone. In such an embodiment, the sub-regions of a given page could be selected with appropriate gestures. For example, dragging and pinching would achieve the typical pan and zoom functions, whilst tapping on an otherwise inactive area of the page could be used to select a region or frame of the page. In this example, the choice of zone to pick could be difficult to predict if there are several overlapping candidate zones. The interface could allow for this by cycling the selection through the set of candidates with each successive tap in the same area. For example, the first tap might select the smallest candidate region, which might be a html <p> tag. A second tap might then expand the selection to the smallest enclosing <div> tag. A third tap might then make a selection further “up” the document object model, with each successive tap changing the current selected zone for a larger candidate zone; eventually returning the selection to the first candidate. In such a way, a touch-driven interface could be used to select and refine the sub-regions to monitor. The advantage of using a touch-driven interface would be especially relevant to an embodiment where the resulting published feeds were consumable on a touch-driven portable device.
As set out in step S106, the tool optionally displays a textual representation of a selected sub-region. Candidate sub-regions would typically be HTML container elements such as <div> or <table> elements, but could also be the boundary of any graphically grouped collection of HTML objects. In the preferred embodiment, the tool also displays a textual representation of the current selection specification, for example as the XPath of the selected framing element. Such an XPath could then be modified directly by the user for more manual, less graphical, control of the region(s) to monitor. For example, as shown in
The graphical user interface and associated tool could be of further assistance to the user if it also highlighted any contained links (e.g. <a> elements). This would help the user to see which links would be included and which would be excluded by the current region selection.
At step S108, the tool determines whether or not the selection of sub-regions is finished. If not, the user could then optionally make additional selections or modify the existing selection by either narrowing or widening its scope. Finally, once the selection is finalised and whether exposed to the user or not as in step S106, once the selections have been finalised, the tool outputs the sub-region specification. The output may be stored in a database or may be sent direct to the page-region monitor tool.
after a user has set up the specification on the region selection tool, or may be automated by the page-region monitor tool itself by accessing the specification database. At step S202, for each test of a candidate web page, the crawler downloads the contents (HTML) of the page. At step S204, the tool identifies the subset of that HTML corresponding to the sub-region specification(s) associated with that page. The crawler maintains a history, of at least the last crawl, of the contents of each sub-region in a history database. As at step S208, on each new crawl, the crawler service compares the current contents with the previous contents of a sub-region to determine if there are new links (e.g. <a> elements). Each new link (i.e. a link that is now present that was not present in the previous crawl) is output at step S210, e.g. to the article crawler.
In the preferred embodiment, the new item contains a title, a thumbnail image and a summary paragraph of text. The title could be extracted from either the anchor text of the originating link or the <title> element of the page itself. The image could be the biggest image on the page (excluding the background image), or some other algorithm to determine the most representative image on the page. The summary paragraph could be the identified as the first contiguous run of text data longer than 10 words long, or some other algorithm to determine the best text summary of the contents of the page.
Alternative algorithms for deciding on the best (or most representative) image on a crawled page include algorithms that use one or more of the following: comparing source URLs to known ad-provider lists (ad-blocking), looking for images with reasonable aspect ratio (to e.g. exclude long/thin images more likely to be page decoration than representative of the page content), applying a minimum and/or maximum size or area of an image (to e.g. exclude iconography or background images), consideration of the entropy per pixel (e.g. to help select photographs over line-based iconography), ignoring common images (either common to current page, or common across several pages from the same site), ignoring images with common advert dimensions, ignoring images occurring too near the top of the web page (to e.g. exclude logo images).
Another embodiment, described below, provides further tooling for the user to define how to construct an item, or article, from the target page. Still other embodiments might choose to simply use the contents of the whole page as the article contents.
Finally, at step S306, the article crawler passes each new item that it has constructed to the publishing component of the system.
Optionally, the region selection tool can be extended to provide support for how an item is generated in the article crawler. For example, as shown in
In all of the above embodiments, the feed may be received and/or the sub-region specification may be conducted on a mobile device which may be any kind of mobile computing device, including laptop and hand held computers, portable music players, portable multimedia players, mobile phones. Users can use mobile devices such as phone-like handsets communicating over a wireless network, or any kind of wirelessly-connected mobile devices including PDAs, notepads, point-of-sale terminals, laptops etc. Each device typically comprises one or more CPUs, memory, I/O devices such as keypad, keyboard, microphone, touchscreen, a display and a wireless network radio interface.
These devices can typically run web browsers or microbrowser applications e.g. Openwave™, Access™, Opera™, Mozilla™, browsers, which can access web pages across the Internet. These may be normal HTML web pages, or they may be pages formatted specifically for mobile devices using various subsets and variants of HTML, including cHTML, WML, DHTML, XHTML, XHTML Basic and XHTML Mobile Profile. The browsers allow the users to click on hyperlinks within web pages which contain URLs (uniform resource locators) which direct the browser to retrieve a new web page.
The Web server can be a PC type computer or other conventional type capable of running any HTTP (Hyper-Text-Transfer-Protocol) compatible server software as is widely available. The Web server has a connection to the Internet. These systems can be implemented on a wide variety of hardware and software platforms.
The servers for crawling or metacrawling can be implemented using standard hardware. The hardware components of any server typically include: a central processing unit (CPU), an Input/Output (I/O) Controller, a system power and clock source; display driver; RAM; ROM; and a hard disk drive. A network interface provides connection to a computer network such as Ethernet, TCP/IP or other popular protocol network interfaces. The functionality may be embodied in software residing in computer-readable media (such as the hard drive, RAM, or ROM). A typical software hierarchy for the system can include a BIOS (Basic Input Output System) which is a set of low level computer hardware instructions, usually stored in ROM, for communications between an operating system, device driver(s) and hardware. Device drivers are hardware specific code used to communicate between the operating system and hardware peripherals. Applications are software applications written typically in C/C++, Java, assembler or equivalent which implement the desired functionality, running on top of and thus dependent on the operating system for interaction with other software code and hardware. The operating system loads after BIOS initializes, and controls and runs the hardware. Examples of operating systems include Linux™, Solaris™, Unix™, OSX™ Windows XP™ and equivalents.
The region selection tool may provide for user login. The user is identified by registering a username and password and then subsequently by logging in with the same username and password. The registration process is a one-time process per user. In a preferred embodiment, the login process is also a one-time process per user by caching their credentials (or a unique key representing their identity) in a cookie. However, where cookies are not supported then the user is required to provide username and password per result publication. The user could be required to login at the first page of the graphical user interface, however, in the preferred embodiment, the user is only prompted for login (if not already identified) when first attempting to connect.
No doubt many other effective alternatives will occur to the skilled person. It will be understood that the invention is not limited to the described embodiments and encompasses modifications apparent to those skilled in the art lying within the spirit and scope of the claims appended hereto.