The present invention relates generally to the field of programmed software agents and in particular to a new and useful software agent for retrieving changing information from predetermined networked web sites.
There are many different types of networks presently known and existing. Local area networks (LANs) and wide area networks (WANS) are typically internal to an organization. These networks are usually isolated from outside users or other networks, but may be interconnected. The Internet is a large global network of interconnected computers.
A particular computer or a file containing information on such a computer may be found through an “address” or URL (uniform resource locator). Any computer which is connected to a network, and especially, the Internet, must have an address which identifies it to the other computers on the network.
Computers which are permanently connected to a network may have files identified by specific URLs which can be accessed by other, remote computer users also connected to the network. When the files contain text and graphics in HTML (Hypertext Markup Language) or similar languages, these files are often referred to as “web pages”. Web pages can be viewed by different users with a software application known as a web browser, such as Netscape's NAVIGATOR browser or Microsoft's INTERNET EXPLORER browser. Each web page that is stored on one of these networked computers has a distinctive URL which can consistently be used to locate the web page and its current content for display in a browser application window.
Web page files which are in HTML or a similar language contain formatting and presentation instructions that can be used by a remote user's web browser to display the content of the web page on their local computer. The text and graphics on the web page that the remote user actually sees are typically referred to as “content”.
In recent years, the Internet computer network has become increasingly commercial and continues to grow in size at a rapid rate. It is possible to find massive amounts of information on trivial subjects in a short period of time using the Internet. However, due to the commercial nature of some sites, advertising has become a major portion of many web sites. On some web pages, the amount of advertising can dwarf the information content of the page. Other pages contain so much information, it is difficult for a user to discern which information is most relevant to him.
The formatting of web pages using HTML and related languages divides content into particular sections, or structures. Often, only one or two of the structures of a particular web page will contain useful information content. The remainder of the page may be advertising or irrelevant information.
Search engines exist to help users find information content on web pages by indexing the pages of owners who register with the search engine against the terms which appear in their web pages. When a user accesses a search engine, the terms entered into the search engine are compared to the previously indexed terms and a listing of hyperlinks to potentially relevant sites is presented to the user. The listing of hyperlinks is generated based on the search engines best guess of which sites are most relevant using a weighting of the search terms. A search engine is not an exceptionally accurate way to find information. But, when a source location is not known, it provides a good starting point.
Agent software, sometimes referred to as “intelligent agents”, “robots”, “bots” or “spiders” is generally known in the art of computers. The term intelligent agent can be used to mean a broad range of software programs having pre-programmed logic for performing particular functions. The particular functions, programming and purpose vary from agent to agent. Most software referred to as intelligent agents operates on many different computers across a network. That is, the agent functions are distributed and require the cooperation of at least two computers.
Agents may be used to perform commercial transactions, such as the intelligent agent disclosed by U.S. Pat. No. 5,983,200. The agent is used to execute tasks electronically using given information and learned information. The agent quickly performs actions across a network which would otherwise be very time-consuming for the user who enabled the agent.
Software agents which can be programmed to perform particular functions are thus very useful and have many different applications.
Agent software executing on a user's personal computer which can retrieve, format and display content from many different remote sources to the user's local personal computer is not presently known.
It is therefore an object of the present invention to provide a search agent software for retrieving a changing information from known remote computer locations.
It is a further object of the invention to provide a software agent which executes on a local computer to retrieve information from remote data sources.
Yet another object of the invention is to provide a software agent that can recognize retrieved content formats for storage and publication purposes.
Accordingly, a software agent is provided which executes instructions on a local user's computer to retrieve potentially changing information content from remote data sources over a computer network, such as the Internet. Different types of software agents are available to retrieve different types of information content from remote sites.
The agent has pre-programmed agent information which the agent uses in conjunction with agent tools and routine libraries to find and identify desired information content. The agent information includes the URL of a remote web page, called the target web page, containing the desired information content, called the target content. The agent retrieves the target web page identified by the programmed. URL to the local computer. The agent parses the target web page using pre-programmed agent information to identify target content structures in the target web page.
Target content can be found by the agent, even if the specific information content changes, and in certain cases, even if the position of the target content changes within the target web page. The agent tools include algorithms for searching the target web site for the web page structure containing the target content, even when the target web site has changed form.
Once the target content is found in the retrieved web page, the target content is saved by the agent in a known structure with some formatting information from the original target web page.
A method for creating the agent involves specifying the type of agent, and supplying agent information including identifying the agent with a name and brief description, identifying the URL of a target web page, identifying start marker text, and identifying end marker text, followed by generating the agent programming using the target URL, agent information and the agent tools and routine libraries. After generating the agent, the accuracy of the agent can be verified by running the agent to ensure it retrieves the target content from the specified section of the target web page
The various features of novelty which characterize the invention are pointed out with particularity in the claims annexed to and forming a part of this disclosure. For a better understanding of the invention, its operating advantages and specific objects attained by its uses, reference is made to the accompanying drawings and descriptive matter in which a preferred embodiment of the invention is illustrated.
In the drawings:
The agent software of the invention is particularly advantageous for use since it is stored on and executes on a local computer where the user of the agent software is located. Execution of agent routines on other computers is not required for the agent to function; the agent software only requires access to the information stored on remote computers to perform its functions. The agent of the invention can be used to create a personal Internet portal for a individual user by retrieving, formatting and storing content from one or more specific remote locations. The stored content can then be put into a personal publication presenting the content from many different remote locations on a single, local page.
The creation and use of the agent software will now be described in greater detail.
Referring now to the drawings, in which like reference numerals are used to refer to the same or similar elements,
The computer network 500 includes multiple data sources 20. Each data source 20 has a unique URL, called a target source or target web page, which can be accessed by the agent software 10 and contains desired information content, called target content. The possible forms for the target source are not limited to traditional web pages, and include HTML documents, XML documents, text files, graphic files, mail messages, database files and other similar types of computer files. Each agent 10 includes a link to a single data source 20. The data sources 20 could be accessed by a conventional web browser and the information content is in a format readable by the conventional web browser.
The agent software 10 resides entirely on the user's computer 5 and, when activated, downloads the target web page located at a specified URL of the data sources 20. Many agents 10 can operate on a single user's computer to retrieve target content from many different target web pages.
Instructions which distinguish the current agent 10 from other agents are input to an agent builder program 115 using the user interface 15 of computer 5. The agent builder program 115 converts the input instructions into smart agent information 120. The smart agent information 120 is essentially data with parameters that can be used by the other agent software modules.
All agents 10 include a foundation 100. The foundation 100 has various agent tool and library routines used by the agent 10 to perform its functions. Tools and library routines may include a function to request and retrieve a target web site from a URL specified by the smart agent information 120, checking algorithms for verifying the accuracy of an agent and other common programming routines that can be combined to produce larger program functions. The foundation 100 further includes communications protocols and HTML and RSS parsing routines, as described in more detail below.
The smart agent engine 110 uses the foundation 100 elements to produce program instructions for the agent 10 based on the smart agent information 120. The smart agent engine 110 includes a predefined process for applying the tools and library routines to the problem presented by the smart agent information 120. A smart agent is the basic agent of the agent software 10.
A search agent includes the search agent information 130. The search agent information 130 adds a place holder to the smart agent information 120 for entering search terms or other information, such as a username/password combination. The search agent may be used to retrieve search results from a known remote site (the target web site) offering indexed, searchable information, among other things. The search agent information 130 causes additional instructions to be added to the program created by the smart agent engine 110.
A custom agent module 150, as shown in
An RSS-type agent 10 is shown in
An RSS-type agent is a simplified version of the smart agent of
The RSS type agent 10 includes the foundation 100 like a smart agent, but the RSS agent engine 112 and RSS agent information 122 are simplified. The RSS agent information 122 consists simply of the URL location of the desired RSS format data to be retrieved. The RSS agent engine 112 contains program instructions designed to specifically retrieve and store content in RSS format that is modified only by the URL location in the RSS agent information 122.
The steps for creating an agent 10 to retrieve information content from all or part of a known web site are displayed in the flow chart of
Once the target URL is identified, optionally, the content of the target web page can be displayed 215 with the user interface 15 in a browser window for reference.
The target page is then parsed 217 by the agent builder 115 to determine the structure of the target page. The syntax and structure are analyzed and decomposed by the agent builder 115 and a parse tree is constructed. The parse tree represents all of the major structural elements found in the target web page, using well-known semantics associated with HTML syntax. The hierarchy of the original target page is determined, along with nodes that correspond to each structural element found in the target document. Plain text, links, image references and all other web page components are related to the HTML syntax elements enclosing them in the target page definition and placed into the parse tree structure as elements of the tree. It should be noted that images and non-text elements are not downloaded since they are result of separate HTTP (Hypertext Transmission Protocol) transactions different from the one required to retrieve the target web page.
In all cases, the original HTML formatting information, structural information and content from the target page are maintained in a form that allows the original version of the target page to be recreated in a functionally equivalent form.
For smart, search and custom agents, the target content of the web page is selected by a user and identified 220 for the agent in two steps. The user selects a unique text at the beginning of the target content and identifies the text for the agent 10. This text is referred to as the start marker text for the target content. Then, a second unique text near the end of the target content is selected and identified for the agent 10. This text is referred to as the end marker text.
The start and end marker text identify a section of the target web page containing content that is desired by a user. The actual text content found in that structure may change periodically; the marker texts are only used to identify the structure within the target page where the target content is initially located on the web site.
Identification 220 of the start and end marker text in the target content can occur in at least three ways. The user can identify the text by manually entering the marker text into an agent builder application window on the user interface 15, the user can cut and paste text from the target web page into the agent builder 115, or the user can select the text in the browser window displaying the target web page and direct the agent builder 115 to retrieve the selected text and use that for the input for the identification 220.
Start and end marker text may consist of plain text, stylized text, HTML syntax elements such as tags or comments, or any other text-based information contained in the target web page.
In all cases, the start and end marker text is used to identify an approximate, human readable location in the precise structure of the target web page that the agent builder 115 can use as a starting point to determine the actual physical location within the web page structure and syntax. The human readable and identifiable location may consist of a single block of content from the target page delineating the entire area of interest, or, it may consist of discontinuous areas of text to be considered the start and end markers for the area of interest.
The unique text used for the start and end marker text does not need to be precisely at the beginning or the end of the content. The agent builder 115 contains an algorithm for checking the identified text in the target page against the marker text and to determine which section or sections of the target web page are intended to be selected.
The marker text is distilled into a case-insensitive version of the text identified 220 by the user, with all unnecessary white space and intermediate formatting removed. The agent builder 115 then searches 230 the parse tree for a sequence of text-based content that matches the marker text. The marker text can span multiple nodes the parse tree and be physically separated by intervening HTML formatting tags. The agent builder 115 can reassemble the linear stream of content-oriented information from the raw HTML information using the structural information in the parse tree. The content stream is compared to the distilled marker text to ensure that the correct structure has been located 230.
As an example of the parsing, assume the following represents the structure of a simple HTML document:
<html>
<head><title>This is a test</title></head>
<body>
<table>
<tr>
</tr>
<tr>
<table>
<tr>
</tr>
</table>
The second table row is excluded from the target content since even though it was a part of the same table, or parent object, it was outside the target object—the first row.
Once the marker text is found 230 in the target page, the structural location within the parse tree is stored. This is done for both the start and end marker text.
If the agent 10 is an RSS agent, then the start and end marker text is not necessary, because the RSS content at the target URL is intended to be taken in its entirety. The RSS content corresponds to the entire desired content and so it is not in a section of a target web page that must be identified like other non-RSS content may be. Thus, steps 220 and 230 may be skipped for RSS agents.
Returning to
The agent builder 115 moves back and forth through the parse tree hierarchy to determine a common structural element containing all of the start and end marker text. Then, program instructions are generated to identify the same location in future, changed versions of the target page. This feature permits the agent to repeatedly and accurately retrieve changing content from the same location of a target page. These instructions are combined with program instructions for automating the download, analysis and extraction steps of the agent execution process (explained below) using the foundation 100 elements. The resulting agent 10 program is stored for future execution.
To use a constructed agent 10, a similar process to the one described above is followed. As shown in
The program instructions generated by the agent creation are used to locate 320 the structural location in the parse tree where the target content was originally found, without regard to the current content at the structural location in the current version of the web page. If the structural location is the same as when the agent 10 was first programmed, the target content will be retrieved, formatted with the surrounding HTML information and stored and/or displayed 340 for the user on the local machine 5.
When the target content is identified in the structure of a retrieved page, the content text is extracted and HTML content is regenerated around the content text based on the structure surrounding the content text in the current version of the retrieved page. The structure of the original target document that was used to create the agent 10 is only relevant to the evaluation step insofar as the original structure was used to generate the program instructions used by the agent to retrieve and evaluate the current version of the target page.
If the structural location cannot be found or has changed from the originally programmed agent information, the agent 10 can evaluate 330 the parse tree to attempt to determine the current location of the target content. The evaluation of a retrieved target page is based on a series of rules derived from the standard syntax of HTML documents. The target content area is by definition contained within some set of hierarchal HTML tags, provided that it has not been eliminated entirely from the target page. The software agent 10 embodies knowledge of these tags, their relationships, and proper syntax and semantics. The agent 10 includes algorithms using this knowledge to determine where the target content structure has been moved to within the target page.
A primary benefit to the agent 10 is that multiple agents 10 can be used to quickly retrieve target content from many different remote sources, all of which can then be displayed in a single application window page.
The retrieved target content is stored on the local users computer 5 in a format which is known to the software agent application 10. The retrieved target content is very simply, data, which is stored on the user's computer 5 in a standard format and can be accessed repeatedly by a display program. The data includes the content text and HTML formatting information.
One or more predefined display structures, called publication templates, can be used to arrange the stored target content into personal web pages having different formats, such as like a newspaper, web portal, etc. The publication templates are programmed with instructions for accessing particular parts of the stored target content and displaying it in a user application window, such as a browser window.
As an example, five agents are programmed to retrieve content consisting of the current news headlines and opening paragraphs of each story from five magazines and newspapers available on remote Internet web sites. A scheduling application activates the agents every hour. The five agents each executes its programmed instructions and retrieves, formats and stores the target content from each of the five news sources on the user's computer 5. After the target content is stored, the user selects a publication template which will display only the headlines from each news publication in its own section on a page in three columns. The associated first paragraph of the story, which is part of the retrieved target content but is not desired will not be displayed using the selected publication template. The template specifies where the content from each publication will begin and which components of the target content text will be displayed. The template may also display information such as the URL where the content was retrieved from, at what time (to show how up to date it is) and the content provider name.
Thus, used in combination in a single software application, the agent 10 and the publication template provide a very powerful tool for retrieving changing target content and displaying the target content in a succinct, useful manner. Such a software application can permit a user to retrieve only desired information from a target web page and screen undesirable content which is of no interest to the user. The application operates faster since it executes on the local user's computer, and only requires an Internet connection to retrieve the target content. Once the target content is retrieved, all operations occur entirely on the user's computer, with no Internet interaction being necessary.
The agent's content generation functions permit it to generate the stored output in any standard text-based format presently known. The agent includes gateway interfaces which permit the agent to communicate using standard network protocols with a wide variety of network services, such as e-mail, HTTP, FTP, etc. The agent includes translation services for converting between disparate types of formats, such as XML, HTML, and WML/WAP).
The agent software is executed at the application level of any operating system. The agent 10 is a peer application to a web browser and any other user-accessible applications, such as word processors, spreadsheets, or games. The agent 10 has the ability to act as an intermediary for the web browser software, allowing the browser to communicate with the agent 10 and the agent to act as a proxy on behalf of the browser for subsequent downstream http requests to remote URLs. The agent also acts as a server of web content to the browser on the local computer 5. The agent software is implemented entirely on the local computer 5.
While a specific embodiment of the invention has been shown and described in detail to illustrate the application of the principles of the invention, it will be understood that the invention may be embodied otherwise without departing from such principles.
Number | Name | Date | Kind |
---|---|---|---|
5339392 | Risberg et al. | Aug 1994 | A |
5416917 | Adair et al. | May 1995 | A |
5587902 | Kugimiya | Dec 1996 | A |
5623653 | Matsuno et al. | Apr 1997 | A |
5649186 | Ferguson | Jul 1997 | A |
5710918 | Lagarde et al. | Jan 1998 | A |
5721908 | Lagarde et al. | Feb 1998 | A |
5727159 | Kikinis | Mar 1998 | A |
5761673 | Bookman et al. | Jun 1998 | A |
5768528 | Stumm | Jun 1998 | A |
5826258 | Gupta et al. | Oct 1998 | A |
5864863 | Burrows | Jan 1999 | A |
5898836 | Freivald et al. | Apr 1999 | A |
5963937 | Yamasaki et al. | Oct 1999 | A |
5974441 | Rogers et al. | Oct 1999 | A |
5978828 | Greer et al. | Nov 1999 | A |
5978842 | Noble et al. | Nov 1999 | A |
5983200 | Slotznick | Nov 1999 | A |
5983267 | Shklar et al. | Nov 1999 | A |
5987403 | Sugimura | Nov 1999 | A |
5996000 | Shuster | Nov 1999 | A |
6012083 | Savaitzky et al. | Jan 2000 | A |
6021426 | Douglis et al. | Feb 2000 | A |
6023697 | Bates et al. | Feb 2000 | A |
6029175 | Chow et al. | Feb 2000 | A |
6067541 | Raju et al. | May 2000 | A |
6067559 | Allard et al. | May 2000 | A |
6070185 | Anupam et al. | May 2000 | A |
6085186 | Christianson et al. | Jul 2000 | A |
6085193 | Malkin et al. | Jul 2000 | A |
6088731 | Kiraly et al. | Jul 2000 | A |
6092099 | Irie et al. | Jul 2000 | A |
6094662 | Hawes | Jul 2000 | A |
6108686 | Williams, Jr. | Aug 2000 | A |
6195608 | Berliner et al. | Feb 2001 | B1 |
6199097 | Hachiya et al. | Mar 2001 | B1 |
6205456 | Nakao | Mar 2001 | B1 |
6209007 | Kelley et al. | Mar 2001 | B1 |
6226642 | Beranek et al. | May 2001 | B1 |
6242966 | Shiotsuka | Jun 2001 | B1 |
6253239 | Shklar et al. | Jun 2001 | B1 |
6324565 | Holt, III | Nov 2001 | B1 |
6339775 | Zamanian et al. | Jan 2002 | B1 |
6366933 | Ball et al. | Apr 2002 | B1 |
6535896 | Britton et al. | Mar 2003 | B2 |
6567816 | Desai et al. | May 2003 | B1 |
6605120 | Fields et al. | Aug 2003 | B1 |
6606525 | Muthuswamy et al. | Aug 2003 | B1 |
6681369 | Meunier et al. | Jan 2004 | B2 |
6738804 | Lo | May 2004 | B1 |
6826594 | Pettersen | Nov 2004 | B1 |
6836774 | Melbin | Dec 2004 | B2 |
6897217 | Neustadt et al. | May 2005 | B2 |
6988135 | Martin et al. | Jan 2006 | B2 |
7000008 | Bautista-Lloyd et al. | Feb 2006 | B2 |
20020004813 | Agrawal et al. | Jan 2002 | A1 |
Number | Date | Country |
---|---|---|
0774722 | May 1997 | EP |
2329309 | Mar 1999 | GB |