1. Technical Field
The present invention generally relates to network data extraction and data tabulation and, in particular, to a process for generating data tables from web content pages in real time.
2. Background
The Internet remains a valuable source of information for various needs. A large volume of data is accessible to the user who wishes to conduct research, query multiple databases and websites, and download data of interest. Such data is most usefully aggregated for presentation in summarized tabular form. Although this can be done manually by the user by opening and populating a spreadsheet, for example, such an approach may become very tedious if the amount of data is large. What is needed is a method for automatically converting downloaded data into tabular form by a process which remains under control of the user.
In one aspect of the present invention, a computer implemented method for acquiring specified web-based data in a tabular format, the method comprising the steps of: performing a web searching operation to acquire web pages containing predefined data; and placing the predefined data into columns of a structural table to form a modifiable table, the characteristics and positions of the modifiable table columns being subsequently determined by a user.
In another aspect of the present invention, a computer readable storage medium having embodied thereon a program, the program being executable by a processor to perform a method for acquiring specified web-based data in a tabular format, the method comprising: performing a web searching operation to acquire web pages having predefined data; and placing the predefined data into columns of a modifiable table, the characteristics and positions of the table columns being determined by a user.
In still another aspect of the present invention, a device suitable for acquiring specified web-based data and converting to a modifiable table, the device comprising: means for acquiring web-based data; a memory for storing a tabulation application that, when executed, functions to convert the acquired web-based data into a modifiable table; and a display for displaying at least one of the web-based data and the modifiable table to a user.
The disclosed invention provides a device and method for generating data tables from web content pages in real time, where either a user can cluster selected web pages, or the device can assemble the cluster. Once the data are in a cluster, the user or the device can convert the data into tabular data. The format of the generated table is a function of the type of data retrieved from the web pages. For example, changes made to the cluster automatically change the corresponding table. If new data indicates a new, different column in the table, the additional column is automatically incorporated into the table. The disclosed device and method function to find similarity among the web pages, and produces a user-modifiable table based on such similar attributes.
There is shown in
The processing unit 110 may include a processor 140 operating to execute a tabulation application 150 resident in a memory 155. The tabulation application 150 may be implemented as a program, software, code, or other instructions stored in the memory 155. Alternatively, the memory 155 and the tabulation application 150 may be provided as a single component, as a firmware chip (not shown), for example. A removable memory 160 and a network port 165 may be provided in the computer 100 for inputting data and software. The network port 165 may provide for an Ethernet connection as shown, for example, or may be a wireless port (not shown). The network port 165 may thus be used to communicate with any communication network such as the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), an intranet, an extranet, a private network, a public network, one or more mobile device networks, a combination of these networks, or other communication network.
The tabulation application 150 functions to convert data extracted from a plurality of web pages into tabular form, in accordance with an aspect of the present invention. That is, the tabulation application 150 makes “suggestions,” and the human user assists in the table modification process. As shown in a generalized flow diagram 200 in
For example, if the user has accessed a website offering books for sale, the tabulation application 150 may induce an extractor that extracts from a specified book web page: the title of the specified book, the price for the specified book, whether the book is hardcover or softcover, the number of pages in the book, and the ISBN for the book.
The extracted data is used by the tabulation application 150 to generate one or more tables 185, at step 230. The tables include all the data fields extracted from the similarly-formatted web pages. A graphics module 170 enables the user to view in the user display 135 a downloaded cluster 175 of these web pages 180 and the one or more tables 185 generated from the data extracted from the web pages 180, as described in greater detail below. If the one or more generated tables 185 are acceptable to the user, at decision block 240, the tabulation application 150 may pause or stop, at step 250.
However, if any of the one or more generated tables 185 are not acceptable to the user, at decision block 240, the user provides formatting feedback to the tabulation application 150, at step 260, and the process returns to step 220. For example, the extractor may have extracted pricing data for a plurality of book web pages, in the example provided above, and created two separate cost fields, one cost field comprising dollar amounts and the other cost field comprising cents.
This table may not be acceptable to the user who prefers a single cost field including both dollars and cents, including a decimal point. Accordingly, the user collaborates with the tabulation application 150 by giving feedback in the form of modifications to the one or more tables. As described in greater detail below, the tabulation application 150 responds by re-learning the extractor such that the extractor subsequently operates to regenerate the one or more tables in accordance with the preferences of the user. This feedback process may include one or more cycles of providing formatting feedback and re-learning the extractor with the tabulation application 150.
Relevant web pages 180 are retrieved, analyzed for page content, and aggregated into one or more clusters 175 of web pages 180, at step 310, preferably so that segments from similar relational columns are grouped together. The clustering operation is typically performed by grouping similar web pages 180 together for presentation to the user. In an exemplary embodiment, the web page capture, data extraction, and text-segment clustering can be performed as described in commonly-assigned patent application publication US 2008/0114800 “Method and system for automatically extracting data from websites,” incorporated in entirety herein by reference.
Generally, site extraction at step 305 can be performed by discovering low-level structure; clustering pages and text segments to find a consistent global structure; and finding the relational form of the data from page and text-segment clusters. The discovery process may begin by first spidering a set of HTML web pages 420 in a web site 410 having data for extraction, such as the web pages 420 shown in
In an exemplary embodiment, one or more of the software experts 430 may use URL patterns, list structures, templates, and page layouts that can provide clues about groups of pages having similar types of data, for performing data extraction and clustering. The software experts 430 find substructures and output page hints 440 and data hints 450 to indicate the similarities and dissimilarities between items (i.e., pages or text-segments). Each heterogeneous expert 430 may be configured to focus on a particular type of structure and work independently from other experts 430 to examine URL patterns on the web-site 410.
A “URL” software expert 430, for example, may be helpful for identifying the web pages 420 that should go into a page cluster 460 and may thereby generate page-hints for pairs of pages whose URLs are similar. The URL software expert 430 typically computes the similarity of the URLs of two web pages 420 based on the length of the longest common subsequence of characters. It is appreciated in the relevant art that web pages 420 that contain the same type of data are usually generated by filling an HTML template with data values.
A “list structure” software expert 430 may operate by searching repeating patterns of a document object module (DOM) structure within each web page 420, particularly when the DOM structure is well-formed and reflects the structure of the underlying data. For web pages 420 in which special characters are used to format lists, rather than using HTML formatting tags, the list structure software expert 430 may not function as well as another software expert.
A “template” software expert 430 may be used to search for, or otherwise identify, token sequences that are common across pages. Token hints may be generated for such sequences whereby token sequences on the HTML web pages 420 can be arranged into a table cluster 470, so that eventually each table cluster 470 contains the data in a column of one of the underlying tables. The template expert 430 is more effective for identifying simple template structure shared by multiple web pages 420, and less effective for execution with a web site 410 that contain one or more web pages 420 not generated by the same grammar as other web pages 420. The template expert 430 typically determines the similarity of two pages by comparing the longest common sequence of tokens to the length of the web pages 420 of interest. The longer the sequence, the more likely the web pages 420 are to be placed into the same cluster.
A “layout” software expert 430 may use the visual representation of a web page 420 which reflects the structure of the data of interest. DOM nodes may be found that are aligned in vertical columns in the display 135, and can generate token-hints for the token sequences represented by these nodes. The page layout expert 430 typically analyzes the visual appearance of vertical columns on the page. To accomplish this analysis, the page layout expert 430 may generate a histogram of the counts of HTML elements that are positioned at each x-coordinate on the display 135. The similarity of these generated histograms is a good indicator that the relevant web pages 420 are of the same page-type. However, it may be more difficult to ascertain the similarity of two web pages 420 when a first web page 420 contains a short list of items, and a second web page 420 contains a long list of items.
After the software experts 430 have analyzed the input web pages 420, the operation may sometimes result in conflicting hints. To avoid complicating the clustering process, a probabilistic approach may be employed that provides a flexible framework for combining multiple hints in a principled way. In particular, a generative probabilistic model may be employed that assigns probabilities to hints (both token hints and page hints) given a clustering. This in turn enables searches for clusterings that maximize the probability of observing the generated page hints 440 and data hints 450.
Referring again to the flow chart 300 of
Alternatively, the user may provide sample URLs, at step 330, so as to direct the process of capturing web pages 180. The tabulation application 150 may generate one or more tables, based on the structure of the web pages 180 downloaded in step 330.
If the modified table is acceptable to the user, at decision block 325, a table agent may be created, based on the modified table properties, and additional content may be harvested, at step 340. If the modified table is not acceptable to the user, at decision block 325, the tabulation application 150 may re-learn the extractor, at step 345, after the user provided collaborative feedback and guidance. The one or more tables are regenerated, at step 350, in accordance with the re-learned user preferences, and the user again determines whether the new tables are acceptable, at decision block 325.
When the user has retrieved a plurality of web pages 180 of particular interest, the user may select two or more of the relevant web pages of greatest interest in
As can be appreciated by one skilled in the relevant art, web pages typically include HTML coding for page formatting and presentation. The tabulation application 150 may function to use this HTML coding to identify fields, lists, and columns in the web pages 510, 520, and 530, or to create columns from the web pages 510, 520, and 530, for placement into a tabular format.
In an exemplary embodiment, the tabulation application 150 can propose a particular extractor, and may attempt to find landmarks and slots in the web page similar to what interests the user. A landmark may identify data fields and a slot may be a data field on a page. This procedure can begin with the acquisition of one web page, and then expand or contract the page columns as more web pages data types are identified. Or, a predetermined number of web pages can be clustered, and then the most common attributes can be appropriated for the suggested table, as explained in greater detail below.
In the diagrammatical example of
In response to the web page capture and selection, the tabulation application 150 automatically generates one or more initial, or proposed, structural tables, such as at steps 315 and 335 in
In the example provided, the similar field data F1a, F1b, and F1c have been placed into the first field data column F1 by the tabulation application 150, the similar field data F2a, F4b, and F2chave been placed into the second field data column F2, the similar field data F3a, F3b, and F3c have been placed into the third field data column F3. Each of the list data L1a and L1b have been placed into respective list tables 620 and 630. Each of the list tables 620 and 630 comprises one or more fields. In the example provided, the list table 620 occurs on URL-a and URL-b, includes three rows of two fields, labled as column F4 and column F5. The list table 630 occurs on URL-c and includes three fields, labled as column F6, column F7, and column F8.
If the format of the table set 600 is acceptable to the user, at decision block 240 in
Operation of a “markup” command is illustrated in
The user can also specify whether “hidden” data should appear in the table set 600, or if the data should remain hidden. As used herein, the term “hidden” refers to data that may not be visible on the web page 180, but is present in the HTML of the web page 180. In an exemplary embodiment, the table set 600 may have the option of showing only visible data, visible date with links, or all data. The tabulation application 150 thus automatically “learns” the table format preferred by the user. The tabulation application 26 then continues to capture additional, similar web pages and extract tabular data for placement into the structural table 98 formatted in accordance with the “learned” user preferences.
An exemplary embodiment of the process for generating modifiable data tables from web content pages in real time may be described with reference to the plurality of screen shots provided in
The user may “click” on the “send” button 940 to initiate a process by which the tabulation application 150 extracts relevant data identified by the HTML coding and begins populating a structural table 1200, as shown in
The exemplary embodiment of the process for generating data tables from web content pages in real time may be further described with reference to a flow chart 1300 provided in
Commands available to the user include, but are not limited to: a “Merge right/left” command that combines a slot (i.e., data field) to the right/left with a current slot; an “Expand right/left” command, that expands the current slot one token to the right/left; a “Delete” command that hides the current slot; a “Name” command that names the column; a “markup” command that allows the user to change the data contents of a table cell and then enable the tabulation application 26 to re-induce the extractor for the respective data column, and a “Filter HTML” command that removes HTML text from a slot.
At step 1330 in
The user has next elected to modify the headings for the columns 1220, 1230, and 1265, at step 1340 in
In an exemplary alternative embodiment, the user can also format the data within a column to suit his preferences, and then save the format of the resulting table. The user may select a cell in the column intended for modification, cut and paste into the cell data in the display format selected by the user, and the corresponding display changes are made down the column by the tabulation application 150. The user may select a desired display format, make the change to a cell, and initiate the change in display to the rest of the cells in the column.
The components shown in
The mass storage device 2030, which may be implemented with a magnetic disk drive or an optical disk drive, is a non-volatile storage device for storing data and instructions for use by the processor unit 2010. The mass storage device 2030 can store the system software for implementing embodiments of the present invention, for purposes of loading the system software into the main memory 2020.
The portable storage device 2040 operates in conjunction with a portable non-volatile storage medium (not shown), such as a floppy disk, a compact disk (CD), or a digital versatile disc (DVD), to input and output data and code to and from the computer system 2000. The system software for implementing embodiments of the present invention may be stored on such a portable medium and input to the computer system 2000 via the portable storage device 2040.
The input devices 2070 provide a portion of a user interface, and may include an alpha-numeric keypad, such as a keyboard, for inputting alpha-numeric and other information, or a pointing device, such as a mouse, a trackball, stylus, or cursor direction keys. As noted above, the computer system 2000 may comprise one or more output devices 2060. Exemplary output devices include speakers, printers, network interfaces, and monitors.
The display system 2080 may include a liquid crystal display (LCD), a plasma display, or other suitable display device (not shown). The display system 2080 receives textual and graphical information from the system software, which processes the information for output to the display device.
The peripherals 2090 may include any type of computer support device to add additional functionality to the computer system 2000. The peripheral device(s) 2090 may include a modem or a router, for example.
The components contained in the computer system 2000 are those typically found in computer systems that may be suitable for use with embodiments of the present invention and are intended to represent a broad category of such computer components that are well known in the art. Thus, the computer system 2000 can comprise a personal computer, a hand held computing device, a cellular telephone, a personal data assistant (PDA), a mobile computing device, a workstation, a server, a minicomputer, a mainframe computer, or any other computing device. The computer system 2000 can also include different bus configurations, networked platforms, multi-processor platforms, etc. Various operating systems can be used including Unix, Linux, Windows, Macintosh OS, Palm OS, and other suitable operating systems.
The above description is illustrative and not restrictive. Many variations will become apparent to those of skill in the art upon review of this disclosure. The scope should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims along with their full scope of equivalents.