The present invention relates generally to information extraction and, more particularly, to methods and systems for real-time extraction of user-specified information.
Search engines can be used to locate individual documents from a large collection of documents, such as the World Wide Web (WWW), or from documents stored on computers of an intranet. Search engines can compile and organize an index of documents by crawling or reading documents, such as web pages. Generally, the crawling of documents occurs on a regular schedule, e.g., daily or weekly. While the regularly scheduled crawl is sufficient for gathering relatively static data, some of the content on the web is “real-time.”
Real-time data on the web is data that is updated after short intervals. Real-time data is most useful to a user during the interval between scheduled crawls. One example of such data is the current price of a stock. Another example is the current score of a sporting event.
Web sites exist that allow a user to view frequent updates of this real-time data. However, these sites often provide more information than a user is interested in viewing. For example, a typical web page on a sports-oriented web site displays multiple games or includes a variety of content in addition to the content that the user wishes to view, such as advertisements. A user may only wish to view one of these scores or a portion of the displayed page. Also, pages containing real-time data may not automatically refresh.
Embodiments of the present invention comprise methods and systems for real-time extraction of user-specified information. One aspect of one embodiment of the present invention comprises receiving a selection of a portion of a web page, wherein the selection comprises a first set of data; dynamically generating an extraction pattern based at least in part on the selection; and extracting a second set of data from the web page based at least in part on the extraction pattern.
This illustrative embodiment is mentioned not to limit or define the invention, but to provide one example to aid understanding thereof. Illustrative embodiments are discussed in the Detailed Description, and further description of the invention is provided there. Advantages offered by the various embodiments of the present invention may be further understood by examining this specification.
These and other features, aspects, and advantages of the present invention are better understood when the following Detailed Description is read with reference to the accompanying drawings, wherein:
Embodiments of the present invention comprise methods and systems for real-time extraction of user-specified information. There are multiple embodiments of the present invention. By way of introduction and example, one illustrative embodiment of the present invention provides a method for extracting updated content from a portion of a web page. The content of a web page may be updated frequently, such as content including sports scores and stock quotes. Users of web browsers may desire to view updates of this content without having to view the entire web page and without having to continually refresh the web page. One embodiment of the present invention provides a method that allows a user of a web browser to select a portion of a web page to be separately displayed and periodically updated. The method may be implemented, for example, as an extension to an application, such as the Google browser toolbar application, or integrated in an application, such as an Internet browser application.
In one method according to the present invention, a user of a web browser selects a desired portion of content on a web page and then clicks on a button on the browser toolbar. Clicking the button causes a new display window to open on the user's display that includes only the content selected by the user. The content displayed in the display window is then periodically updated from the web page without any user intervention. To update the displayed content, the method dynamically generates an extraction pattern by which content corresponding to the user's selection is periodically extracted. The extraction pattern, such as an extraction wrapper, can be generated based on the location of the user's selection in the web page structure. The location may be a location in Document Object Model (DOM) tree structure of the web page or may be otherwise determined.
For example, a user can select a baseball box score for an ongoing game on a sports or news-oriented web page and then click on a button on a browser toolbar to indicate that he wants to receive updated displays of this selection. The baseball box score is displayed in a separate display window and an extraction pattern is generated based on the location of the box score in the DOM tree structure of the web page. The extraction pattern is then used to periodically extract the box score data from the web page. The display window is periodically updated using the extracted box score data. In one embodiment, the user can modify preferences related to the display, such as the period between updates.
This introduction is given to introduce the reader to the general subject matter of the application. By no means is the invention limited to such subject matter. Illustrative embodiments are described below.
Various systems in accordance with the present invention may be constructed.
Referring now to the drawings in which like numerals indicate like elements throughout the several figures,
In one embodiment, an extraction processor 112 may reside on a client device, such as client device 102, connected to the network 106. When a user specifies a Uniform Resource Locator (URL), the client device 102 issues a request to the web server 156 for a particular web page. The web server 156 responds to the request by sending the web page to the client 102. The web server 156 may provide static and dynamic web pages. The user then selects a portion of the web page containing a data set. The extraction processor 112 determines a pattern for extracting the selected data from the web page and then extracts the data, causing the data to be displayed in a separate display on the client device 102. The extraction processor 112 then periodically requests updated web pages from the web server 156. Upon receiving the updated page, the extraction processor 112 extracts an updated data set from the portion of the updated page corresponding to the user selection and causes the updated data set to be displayed to the user.
Examples of client device 102 are personal computers, digital assistants, personal digital assistants, cellular phones, mobile phones, smart phones, pagers, digital tablets, laptop computers, Internet appliances, and other processor-based devices. In general, a client device 102 may be any suitable type of processor-based platform that is connected to a network 106 and that interacts with one or more application programs. The client device 102 can contain a processor 108 coupled to a computer readable medium, such as memory 110. Client device 102 may operate on any operating system capable of supporting a browser or browser-enabled application, such as Microsoft® Windows® or Linux. The client device 102 is, for example, a personal computer executing a browser application program such as Microsoft Corporation's Internet Explorer™, Netscape Communication Corporation's Netscape Navigator™, Mozilla Organization's Firefox, Apple Computer, Inc.'s Safari™, Opera Software's Opera Web Browser, and the open source Linux Browser.
Memory 110 of the client device 102 contains a real-time information extraction application program, also known as an extraction processor 112. The extraction processor 112 comprises a software application including program code executable by the processor 108 or a hardware application that is configured to facilitate identifying and extracting information from a portion of a web page and displaying or otherwise outputting the original and updated portion of the web page to a user.
The extraction processor 112 illustrated in
The extraction processor 112 includes program code for receiving a selection of a portion of a web page from a user. The extraction processor 112 also includes program code for generating an extraction pattern based on the selection by the user. The extraction pattern provides a means for the extraction processor 112 to identify the content of interest to the user when the page is subsequently updated, such as when a sports score or stock price is updated.
The extraction processor 112 also includes code for extracting the original and updated content based on the extraction pattern. After the extraction processor 112 extracts the content, the extraction processor 112 causes the updated content to be displayed in a window on the user's display device. In other embodiments, other means of performing the functions may be implemented. These systems and methods are described in greater detail below.
The server device 150 shown in
Such processors may include a microprocessor, an ASIC, and state machines. Such processors include, or may be in communication with computer-readable media, which stores program code or instructions that, when executed by the processor, cause the processor to perform actions. Embodiments of computer-readable media include, but are not limited to, an electronic, optical, magnetic, or other storage or transmission device capable of providing a processor, such as the processor 152 of server device 150, with computer-readable instructions. Other examples of suitable media include, but are not limited to, a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, optical media, magnetic tape media, or any other suitable medium from which a computer processor can read instructions. Also, various other forms of computer-readable media may transmit or carry program code or instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless. The instructions may comprise program code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, and JavaScript. Program code running on the server device 150 may include web server software, such as the open source Apache Web Server and the Internet Information Server (IIS) from Microsoft Corporation.
It should be noted that the present invention may comprise systems having different architecture than that which is shown in
When the user indicates that the extraction processor 112 should extract the selected portion of the web page 202 shown in
In one embodiment of the present invention, the extraction processor (112) utilizes the document object model (DOM) tree to determine the location of a selection in a web page. When the user selects a portion of a web page, the user is, in effect, selecting a sub-tree of the underlying structure of the web page.
The first row 404 of the table 402 includes two cells 406, 412. The first cell 406 includes an anchor 408, which is used to create a hyperlink on the rendered page. The text 410 associated with the anchor 408 is “San Francisco.” The second cell 412 of the first row 404 includes text 414, which corresponds to the number of runs scored by San Francisco. In the embodiment shown in
Similarly, the second row 416 of the table 412 includes two cells 418, 424. The first cell 418 includes an anchor 420 with anchor text 420 equal to “Pittsburgh.” And the second cell 424 of the second row 416 includes text corresponding to the number of runs scored by Pittsburgh, in this case, 2.
The first row 504 of the table 502 includes two cells 506, 512. The first cell 506 includes an anchor 508, which is used to create a hyperlink on the rendered page. The text 510 associated with the anchor 508 is “San Francisco.” The second cell 512 of the first row 504 includes text 514, which corresponds to the number of runs scored by San Francisco. In the updated information of the embodiment shown in
Similarly, the second row 516 of the table 512 includes two cells 518, 524. The first cell 518 includes an anchor 520 with anchor text 520 equal to “Pittsburgh.” And the second cell 524 of the second row 516 includes text corresponding to the number of runs scored by Pittsburgh. In the updated page, Pittsburgh now has 3 runs.
In one embodiment of the present invention, the extraction processor (112) receives a selection of the table 402 shown in
Subsequently, the extraction processor (112) requests an updated version of the page. The extraction processor (112) then retrieves the location of the table 402 from memory. The extraction processor (112) uses the location of the table 402 to find the information contained in table 502.
In one embodiment, the extraction processor (112) utilizes the context around the scores to determine whether of not the updated information is equivalent to the user's selection in the original page. The extraction processor (112) first determines the parent, table cell 524, of the score that has changed 526. The extraction processor (112) then determines the parent, table row 516, of table cell 524. The extraction processor (112) then compares the information contained in table row 516 to the information contained in table row 416. In this case, the two sets of data are very similar, differing by only one attribute—the contents of cell 524 differ from the contents of cell 424. Accordingly, the extraction processor (112) displays the table 502 as the updated equivalent to table 402. In other embodiments, the extraction processor (112) uses other methods to compare the similarity between the original and updated information.
Various methods may be used to determine the location of a selection within a document. For example, in one embodiment, path labeling is used to determine the location.
The path of a node is the sequence of nodes from the root of a tree to the node v. Various types of paths may be defined. In the embodiment shown in
In contrast, the tag path is equal to “A” for node A 602, “A.B” for node B 604, “A.B.C” for node C 606, “A.B.D” for node D 608, and “A.B.E” for node E 610. The tag path is also “A.B” for node B 612 and “A.B.C” for node C 614.
In one embodiment of the present invention, the extraction processor (112) uses the sibling path to determine the location of a selection. The selection may span multiple nodes. For instance, a selection of the nodes C 606, D 608, and E 610 can be represented by “0.0[1-3] ”. When an updated page is received, the extraction processor (112) then uses the sibling path location stored in memory to locate the information in the updated page.
While the sibling path is useful for finding the same node on multiple pages, the tag path is useful for finding similar nodes on the same page. This is due to the fact that the tag path for multiple nodes on a single page may be the same, such as nodes 606 and 614. In contrast, the sibling path for these two nodes 606, 614 is unique. In another embodiment, the tag path may be used to determine context. For instance, the extraction processor (112) may use the tag path of the stored location to locate other similar nodes and store the content of those nodes in memory. When the updated page is received, the extraction processor (112) locates the updated content using the sibling path and then uses the tag path to validate that the path to the content is the same and to compare stored context with context of the updated information in the updated page. If similar, the extraction processor (112) concludes that the identified information corresponds to the information selected by the user in the original page.
Various methods in accordance with embodiments of the present invention may be carried out.
The extraction processor 112 then dynamically generates an extraction pattern for the portion of content selected by the user 704. The extraction pattern may comprise, for example, the location within the web page at which the selection begins and the location at which it ends. In another embodiment, the extraction pattern comprises the location at which the selection begins and an indicator of how much data to extract. For example, the determined location may be the location associated with row in a table, and the extraction pattern may include an indicator specifying that two table rows are included in the selection starting at the determined location.
The extraction pattern may also be referred to as a wrapper. Generating the extraction pattern for information in web documents may also be referred to as wrapper induction. Data on web pages, such as the web page shown in
This repetitive structure is due to the way in which the web page is created. For example, when the web server 156 receives a request for data, it typically searches a data store for each game to be displayed and data associated with the game, such as the score. For each retrieved record from the data store, the web server 112 typically executes a script, such as a CGI (Common Gateway Interface) script, and uses an HTML (Hypertext Markup Language) template to display the data. Once the HTML template is filled in with data from each of the retrieved records, the completed HTML page is sent to the requestor. In web servers utilizing eXtensible Markup Language (XML), eXtensible Style Sheet Language (XSL) is used to transform XML data into an HTML page.
Since an HTML template is used to construct the portion of the web page containing data of interest to the user, the structure of the web page containing the selected portion should remain relatively constant after each update of the data. Accordingly, by determining the location of the data in the page in which the user selection occurs, the extraction processor (112) is able to determine where to search in the updated page for the corresponding updated content.
Further, the context of the data that is updated is likely to remain the same or similar between updates. For example, in the portion of the web page selected by the user in
Referring still to
For example, when the user selects the window 204 shown in
The extraction processor (112) next receives the updated web page 808. The extraction processor (112) may receive the page in various ways. For example, in one embodiment, the extraction processor (112) includes code that causes the program to pause for a specified time period, e.g., five minutes. At the end of the period, the extraction processor (112) executes a Java applet or JavaScript to retrieve data from the web site of the web page in which the user made the original selection. In response to the request, the extraction processor (112) receives the HTML page from the web server (156).
In response to receiving the HTML page, the extraction processor (112) retrieves the structure location from memory (110). The structure location may be, for example, the sibling path to the table 402 shown in
As discussed above, the information present in the HTML document and represented by the document object model is hierarchical. For example, a user may select subset of a web page including the name of a sports team may include the following:
The HTML shown above has five nodes: the root node is parentTag. The root node has two children, childTag0 and childTag1. The node childTag0 has a name, “txtHomeTeam.” The node childTag1 also has a name “txtScore.” The node childTag0 also has a child, the text “San Francisco.” And the node childTag1 has a child, the text “7.” In one embodiment, the extraction processor 112 stores the location of the selection based on the name of the first childTag, “txtHomeTeam.”
The extraction processor 112 then pauses a specified period of time. After pausing, the extraction processor 112 retrieves the updated page. For example, the extraction processor 112 may execute code such as:
The read method of the InputStream object can then be used to retrieve the text of the updated page. The extraction processor 112 can then search the content of the page and extract updated information. Various other implementations may be used by embodiments of the present invention.
The extraction processor 112 then retrieves the user selection in the updated page based on the name. To retrieve the location of the node childTag0 by name, an extraction processor 112 according to one embodiment of the present invention may execute code similar to the following:
pageLocation=document.getElementById(“txtHomeTeam”);
To then extract the name and score, the extraction processor may execute the following:
After this code has executed, the value of the teamName variable is equal to “San Francisco” and the value of the Score variable is equal to “7.” The extraction processor 112 can then use this data to update the display window. Alternatively, the extraction processor 112 may simply extract the HTML itself for display in the display window.
Various methods may be implemented to ensure that the correct updated content is displayed to the user. For example, in one embodiment, the context, e.g., content near the updated content, is utilized to ensure that the correct updated content is displayed to the user.
In the embodiment shown in
The DOM tree shown in
The extraction processor (112) next receives an updated web page 909. The extraction processor (112) retrieves information in the updated web pages at the structure location previously stored in memory 910. In other words, the extraction processor (112) looks for updated information in the updated web page at the same location that the extraction processor (112) found the original content selected by the user in the original page.
The extraction processor (112) then identifies the context of the information retrieved in the updated web page 912. In the example described above in relation to
If the context of the original information is similar to the updated context 913, then the extraction processor (112) displays the updated information 916. The process illustrated in blocks 909-916 is repeated periodically as the web page is refreshed. If the context is not similar 913, the extraction processor (112) retrieves information that is near the stored structure location 914. Information near the stored structure location may be defined in various ways. For example, the extraction processor (112) may retrieve information that shares the parent of the selected content within the DOM tree or may be adjacent to the selected content within the DOM tree.
The extraction processor (112) then retrieves the context of the newly retrieved information 912 and compares the context of the newly retrieved information with the context of the originally selected information 913. The extraction processor continues repeating these processes until the updated content is found. In the example described above in relation to
In the embodiment shown in
The foregoing description of the embodiments, including preferred embodiments, of the invention has been presented only for the purpose of illustration and description and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Numerous modifications and adaptations thereof will be apparent to those skilled in the art without departing from the spirit and scope of the present invention.