The Internet contains vast amounts of information from which users can find relevant content. However, the relevant content is generally embedded in HTML pages which are designed to be rendered for viewing. As a result, extracting the content of a webpage for other purposes (e.g., to create an RSS feed, to insert appropriately-formatted data into a spreadsheet, and so forth) is somewhat difficult.
A wrapper may be defined as a software program that extracts desired information from a source page and transforms it into a structured format. Because of the numerous ways in which source data may be structured and reformatted as desired, and because pages and formats change over time, various types of wrappers exist. Manually coding such wrappers is difficult, and thus automatic methods, referred to as wrapper induction, are often used to develop wrappers for webpages.
Automatic wrapper induction methods can be generally divided into two types of approaches based on how they are generated. One type of approach generates wrappers without pre-labeled training samples. In practice, however, these methods are difficult to apply to commercial systems that demand high extraction accuracy and performance.
Another group uses manually labeled examples as training data. For manually labeled training, a user defines an extraction schema and labels a set of training pages with the schema. Then, the labeled training pages are fed into a wrapper-induction system that generates one or more wrappers. When a new page is acquired, a wrapper is selected to extract data and fit the extracted data into the pre-defined schema.
While such training based wrapper generation methods can achieve desirable extraction performance, they can be problematic, as labeling is costly and error-prone. There is thus a tradeoff between high extraction accuracy and labeling costs.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which a wrapper tool uses existing wrappers to automatically assign labels to records of a webpage when an existing wrapper corresponds to that record. For unlabeled records, the tool provides a user interface to label those records, and updates the set of existing wrappers with a new wrapper that is generated based upon the labeling operation.
The technology tool thus generates wrappers for individual records of a webpage, that is, at the record-level, rather than a wrapper for the entire webpage. Further, the technology employs a labeling strategy, referred to as wrapper-assisted labeling, that utilizes previously generated wrappers to label additional records of a partially labeled page, or some or all of the records of a wholly new page, when applicable.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards a labeling tool that generates wrappers for records of a webpage, that is, at the record-level (in contrast to conventional wrapper induction methods that operate at the page-level). Further, the technology employs a labeling strategy, referred to as wrapper-assisted labeling, that utilizes previously generated wrappers to label additional records of a partially labeled page, or some or all of the records of a wholly new page.
It should be understood that any of the examples herein are non-limiting. Indeed, a particular implementation having a labeling tool with a certain user interface is described, however this is only one example. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and webpage processing technology in general.
Turning to
As described herein, instead of labeling via wrappers at the page level, record-level wrappers are used, which avoid repetitive labeling on the same page of records having identical or very similar structures. To this end, a Document Object Model (“DOM”) tree 106 is used to model a webpage in a known manner. However, as described herein, the wrappers are trained at a record level instead of generating page-level wrappers. More specifically, when a user finishes labeling one record of a page that is being labeled 108 (via a user interface 110 as described below), the tool 102 and its logic 103 extracts a sub-DOM tree (e.g., S1, S2 or S3 in
Thus, when a generated wrapper is used to extract data, the wrapper does so at the record level, instead of the whole page. In other words, a wrapper is able to match a sub-DOM tree corresponding to a record of a page, then extract data from the matched sub-tree. When multiple wrappers match the same sub-tree, the tool 102 automatically selects a wrapper with the best/closest match (minimal distance to matched sub-tree) to perform the extraction.
Turning to another aspect, wrapper-assisted labeling uses previously generated wrappers to predict labels of similar records. This is accomplished by performing data extraction using existing wrappers 112, including any of those previously generated while labeling the current page. Wrapper-assisted labeling is advantageous because it can avoid repeatedly labeling records with the same or similar internal structure, and thus reduces the overall labeling effort.
The amount of labeling effort that record-level wrapper-assisted labeling can save can be formalized by considering the labeling of a group of pages from the same template, in which the total number of records is Nr. According to the records' internal structures, all Nr records can be virtually partitioned into Ns subsets. Records in each subset have the same internal structure. Then, with a wrapper-assisted labeling strategy, the user only needs to label one representative record for each subset, and use the wrapper that is generated from that labeling operation to automatically assign labels for the rest of the records. As a result, the maximal number of records that need to be labeled is Ns. In general, Ns<Nr; for example, statistical analysis of seven well-known online shopping websites shows that that the ratio of Ns to Nr is approximately 1.00 to 9.58.
To help describe the operation of the tool 102,
As described above, given a new page, the labeling process may use existing wrappers by detecting applicable wrappers. More particularly, as represented by step 204 of
In the event that no applicable wrappers exist for a given record, the user needs to assign labels (as detected by step 206) to each unlabeled record. In the labeling mode of the tool, assigning labels to a record can be accomplished with relatively few mouse clicks, as illustrated in
As shown in
Also shown in the menu 336 is an identifier (ID “0”) before each attribute name for the selected record, which is used to align labeled attributes with a record ID. The ID distinguishes each record from the others. An ID controller mechanism 338 (e.g., text box and up and down arrows) allows selection of a record ID to label. When finished labeling a record, the user can click on the Plus (“+”) button 340 at the right of ID controller. In this example, when clicked, the system advances to label the next record, e.g., the ID prefix will change from 0 to 1 in the menu 336 and in the mechanism 338. In addition, for already labeled (e.g., automatically labeled) records, to correct a record such as a record having ID 6, the user may input a 6 (e.g., by typing or using the up and down buttons) in the text box of the ID controller mechanism 338, which will also change the ID prefix shown in the menu 336.
After a single record is labeled, it can be used to generate or update the underlying wrappers. This process is automatically done by the tool 102 when a user clicks “Update Wrappers” 342. This is represented in
As described above, the wrapper-assisted labeling strategy uses any previously generated wrappers to predict labels of similar records within a webpage, as represented by step 210. Referring to the examples of
In many cases, records in the same page are very similar, even to the point of being identical in terms of internal DOM tree structure. Therefore, labeling as little as one record can generate a wrapper, which can in turn automatically label the remaining similar records without further manual labor. In general, only relatively few records need to be labeled to generate the wrappers needed to complete the labeling of the other records of a webpage.
However, particularly in the early stages of labeling, the existing wrappers may not be able to correctly predict all labels. For some records, they may have wrongly assigned labels, or some of the attributes of a record may not be labeled, that is, the record is not completely labeled. For example, in
Steps 212 and 214 allow the modification of wrong or incomplete labels assigned by the underlying wrappers. In general, the process of modifying labels of a record is very similar to that of assigning new labels. Step 216 allows any remaining unlabeled records to be labeled (by returning to step 206)
Step 218 represents determining whether the pages of the set have been labeled. If not, the process returns to step 202 to label the next page.
When the pages have been labeled, any new wrappers that have been generated are ready to be used in conjunction with previously existing wrappers, e.g., to label another set of webpages. The label records may be used as desired, e.g., exported to a program or other entity (such as a storage medium).
As can be seen, the labeling task for one page can be accomplished by labeling a new record, updating one or more wrappers, applying the wrappers to other records if possible, revising labels as needed, and repeating until complete. Any generated wrappers may then be used in labeling the next page and so on. Once labeling is complete, these and/or other pages may have their data extracted via the wrappers.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 510 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 510 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 510. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 530 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 531 and random access memory (RAM) 532. A basic input/output system 533 (BIOS), containing the basic routines that help to transfer information between elements within computer 510, such as during start-up, is typically stored in ROM 531. RAM 532 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 520. By way of example, and not limitation,
The computer 510 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 510 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 580. The remote computer 580 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 510, although only a memory storage device 581 has been illustrated in
When used in a LAN networking environment, the computer 510 is connected to the LAN 571 through a network interface or adapter 570. When used in a WAN networking environment, the computer 510 typically includes a modem 572 or other means for establishing communications over the WAN 573, such as the Internet. The modem 572, which may be internal or external, may be connected to the system bus 521 via the user input interface 560 or other appropriate mechanism. A wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 510, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 599 (e.g., for auxiliary display of content) may be connected via the user interface 560 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 599 may be connected to the modem 572 and/or network interface 570 to allow communication between these systems while the main processing unit 520 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.