The present invention relates to information extraction techniques, and more specifically, to improving the selection of a set of pages to be annotated, by a human, from a site of structurally similar pages, in order to improve the robustness and recall of information extraction learning.
The Internet is a worldwide system of computer networks and is a public, self-sustaining facility that is accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web, often abbreviated “www” or simply referred to as just “the web”. The web is an Internet service that organizes information through the use of hypermedia. Various markup languages such as, for example, the HyperText Markup Language (“HTML”) or the “eXtensible Markup Language (“XML”), are typically used to specify the contents and format of a hypermedia document (e.g., a web page). In this context, a markup language document may be a file that contains source code for a particular web page. Typically, a markup language document includes one or more pre-defined tags with content enclosed between the tags or included as attributes of the tags.
Today, a plethora of web portals and sites are hosted on the Internet in diverse fields like e-commerce, boarding and lodging, and entertainment. The information presented by any particular web site is usually presented in a uniform format to give a uniform look and feel to the web pages therein. The uniform appeal is usually achieved by using scripts to generate the static content and structure of the web pages, and a database is used to provide the dynamic content. The information presented by such a web page is generally found at visually strategic locations on the page. Thus, extracting information from web pages requires identifying the areas on the pages where information is presented, and extracting and indexing the relevant information. Information extraction from such sites becomes important for applications, such as search engines, requiring extraction of information from a large number of web portals and sites.
In their most generic form, information extraction techniques are called wrappers or structural templates. Two non-limiting examples of information extraction techniques are rule-based extraction and statistical machine-learning extraction. In order to extract information from a particular set of structurally-related web pages, referred to as a site or cluster, a wrapper generally learns a set of extraction rules based on the structural characteristics of the web pages in the site. These structural characteristics are identified through the use of training pages, which are a subset of web pages in the subject site that are annotated by humans and then input to the wrapper. Selection of training pages is sometimes called sampling, and the training pages themselves are sometimes called samples.
Some information extraction systems select random pages for annotation, or base the selection of pages on human judgment. Samples chosen at random do not guarantee coverage of all structural variations in the cluster of related pages and may submit for human annotation redundant sample pages, incurring extra cost of human annotation. Human-based page selection is non-trivial, cumbersome, erroneous, prone to omissions, and does not guarantee the selection of appropriate samples because visually similar pages might differ in their underlying structural representation. Also, human-based sampling can be expensive because a human can spend a lot of time reviewing the pages in a cluster in order to select representative pages of the cluster.
To annotate a sample page, a human inspects the page and manually identifies areas of the page having attributes of interest. Those attributes identified by a human to be interesting are called key attributes. The wrappers use the information provided by human annotations to identify trends in the placement of certain kinds of information presented by the web pages of a site. Extraction rules are generally derived from these identified trends. Annotations are costly because of the time that must be spent in order for a human to annotate a set of training pages.
Although many web sites are script-generated, the web pages of a web site can vary in their structure because of optional, disjunctive, extraneous, or styling sections. If small but important structural variations are not annotated by a human to identify the structural variations, the wrappers may fail to extract required attributes from pages having such variations. Thus, there is a need to annotate pages in a site that are representative of the variations in structure in the pages of the site while keeping the cost of human annotation to a minimum.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
The recall of a wrapper, which is the ability of the wrapper to accurately extract information from all of the pages in a site, mainly depends upon the representativeness of the structure of the annotated pages in the training set input to the wrapper. For example, Site A might consist of data structures ‘a’, ‘b’, ‘c’, ‘d’, and ‘e’. If a training set input to a wrapper for Site A consists of a single annotated page representing only data structure ‘a’, then the wrapper would have a low recall because the wrapper would only be able to recognize structure ‘a’ in the rest of the pages of Site A, and would be ignorant of structures ‘b’ through ‘e’. However, if an annotated page representing structures ‘b’ through ‘e’ were added to the training set for Site A, then the wrapper would have a very high recall because the wrapper would recognize all of the structures in the pages of Site A. For a further example, Site B might contain structures ‘a’, ‘b’, and ‘c’, and also structural variations of structure ‘c’: ‘c1’, and ‘c2’. A structural variation in a site is the visual presentation of the same type of information, i.e., the information represented in structure ‘c’, using different underlying structures on different pages of the site, i.e., structures ‘c’, ‘c1’, and ‘c2’. In order to have maximum recall, pages representing structures ‘a’, ‘b’, ‘c’, ‘c1’, and ‘c2’ should be represented in the pages of the training set for Site B. Thus, it would be advantageous to increase wrapper recall by presenting to humans for annotation those pages that are most structurally representative of the cluster of pages from which information is to be extracted. The problem of choosing which pages to present to humans for annotation is called the page sampling problem.
In one embodiment of the invention, a site is a set of structurally similar pages. In another embodiment of the invention, passive sampling is used to identify, from a site, a subset of pages, which, if included in the training set of a wrapper, would maximize the recall of that wrapper. This subset of pages, identified by passive sampling, is ordered by the recall addition of the respective pages, such that the first page is the most representative page of the site. When annotated in order, each annotated page adds the maximum amount of recall to the wrapper. Thus, pages are presented for human annotation in order starting with the most structurally representative page that includes the most interesting attributes. After the most representative page, subsequent sample pages are presented that represent most of the structural variations in the site that have not yet been presented for human annotation, thus ensuring maximum recall. Once the samples required for the training set for a site have been selected using above method, or no more pages are required to represent all of the unique structures in the site, the first page is surfaced, or presented, to a human for annotation.
In another embodiment of the invention, the page sampling problem is mapped to the set-cover problem. The set-cover problem states that, given an input of several sets containing some elements in common, the goal is to select a minimum number of these sets such that the selected sets contain all of the elements that are contained in any of the sets in the input. One solution to the set-cover problem is the greedy solution where a set is selected to be part of the solution if the set contains a maximum number of elements not covered by sets already selected to be part of the solution, i.e., uncovered elements. In the context of mapping the page sampling problem to the set-cover problem, a “set” is a document in a site from which information is to be extracted, and an “element” is a structure in a document. Given that the documents in a site have some structures in common, the principles of the set-cover problem can be used to select a minimum number of sample documents from the site that cover all of the unique structures in the site, thus improving recall with minimum human annotation cost. Implementing the greedy solution in the context of the page sampling problem, a document is selected to be in the solution set if the document represents the maximum number of structures not covered by documents already in the solution set. However, unlike the classical set-cover solution, the solution set of the page sampling problem is ranked based on representativeness of the documents in the solution set such that the top documents in the solution represent most of the unique, representative structures having higher importance based on the content associated with those structures.
In another embodiment of the invention, active sampling is used to increase wrapper recall even further. In active sampling, the list of documents to be annotated is actively refined after each human input, using passive sampling techniques in conjunction with information derived from interesting attributes identified by human annotation and information gleaned from the structure of the annotated data region. Thus, redundant samples brought to light by the human annotations are eliminated from the sample list and the list is reordered based on the representativeness of the samples still in the sample list, which improves the potential recall added by subsequently annotated pages.
As such, passive sampling and active sampling can be used to optimize human annotation cost and improve the extraction recall. Passive sampling is invoked in the absence of human annotations and is expected to select a minimal, ordered, representative list of samples. Active sampling can optionally be invoked once human annotations are available for the first page in the sample list produced by passive sampling, in order to refine and reorder the sample list, based on the annotations provided.
Passive sampling can be used to aid in selecting those pages of a site that will add maximum recall to a wrapper while using minimum human input. In one embodiment of the invention, in the absence of human annotation, web pages can be ranked based on a structural representativeness score of each page. The structures in a web page are represented by the various XPaths found in the page, and the representativeness score of a particular page is based at least in part on an analysis of the XPaths found both in the particular page, and in the other pages of the site.
XPath is a language that describes a way to locate and process items in XML documents by using an addressing syntax based on a path through the logical structure of the document, and has been recommended by the World Wide Web Consortium (W3C). The specification for XPath can be found at http://www.w3.org/TR/XPath.HTML, and the disclosure thereof is incorporated by reference as if fully disclosed herein. Also, the W3C tutorial for XPath can be found at http://www.w3schools.com/XPath/default.asp, and the disclosure thereof is incorporated by reference as if fully disclosed herein. Herein, references to an “XPath,” or “path” refer to an attributed XPath of a leaf node, unless explicitly stated otherwise, for purposes of explanation. However, a person of ordinary skill in the art will understand that the embodiments of the invention can be implemented using XPaths of any form. In one embodiment of the invention, the definition of an attributed XPath of a particular item in a document is taken to be the set of nodes found in the path to the particular item from the root of the document's Document Object Model (DOM) tree, including the name of each node and the attribute list of each node, inclusive of the root and the particular item. The attributes in each node are ordered alphabetically in an attributed XPath.
For example,
In another embodiment of the invention, the values of the “class” attributes of the nodes in a path are included in an attributed XPath because such class information can be used in classifying the type of the subject node. “Class” is one of the core HTML attributes and allows authors of web pages to define specific types of a given element. Thus, in this embodiment of the invention, the attributed path of text node 101 includes the value of the “class” attribute in the “table” node, as follows: /<HTML>/<body>/<table border, class=“product_id”, width>/<tr>/<td width>/<#TEXT>.
Because an attributed XPath is an unnumbered XPath, the attributed XPaths found in a particular web page are not necessarily unique. For example,
As previously stated, the problem of selecting pages for a wrapper's training set can be solved using ideas from the conventional set-cover problem, which is an optimization problem that is NP-Hard and has several approximate solutions. The greedy approximate solution, implemented in one embodiment of the invention, works by selecting and annotating the most representative page of the site based on a representativeness score. The representativeness score of a page is a function of (a) the frequency that an XPath occurs in a particular page of the site, (b) the frequency that an XPath occurs among the various pages of the site, and (c) the co-occurrence of an XPath with content presented by the pages of the site. Thus, in one embodiment of the invention, the first page selected to be annotated for a training set has the highest representativeness score. Subsequently, the second most representative page is selected to be annotated based on recomputing the representativeness score for each page in the site except the first page, ignoring XPaths present in the first page, and selecting the page having highest score based on the recomputation, and so on.
In an example process for passive sampling illustrated by
In one embodiment of the invention, the representativeness score of a page is computed based at least in part on the term frequency of each XPath for each web page in the site (XPath-TF), determining the document frequency for every XPath in the site (XPath-DF), and determining the importance of each XPath in the site (XPath-Imp).
One embodiment of the invention computes structural information in terms of XPath term frequency (XPath-TF), which is the number of times a particular XPath occurs in a particular web page of the site. In the calculation of XPath-TF denoted TF(Xij), the subject XPath is denoted Xi, and the subject web page is denoted Pj. Thus, TF(Xij) represents the number of times Xi appears in page Pj.
A high XPath-TF for an XPath in a web page will generally boost the overall representativeness score of the page because a high number of a particular XPath in a page increases the chance that the page covers most of the informative attributes associated with that XPath, and including such a page in the training set of a wrapper would increase the robustness of the wrapper learning. Furthermore, a wrapper learning process will encounter positive candidates and a variety of negative candidates for each key piece of information in a site, and a page having a higher XPath-TF might cover a majority of the negative candidates. It is beneficial to include such a web page in the training set because information on negative candidates also leads to a more robust wrapper learning. Thus, for a particular XPath, a page with a higher XPath-TF value for the particular XPath will be given preference over a page with lower XPath-TF for the particular XPath.
Another embodiment of the invention computes structural information in terms of XPath document frequency (XPath-DF). The document frequency of an XPath, Xi, is denoted DF(Xi), and signifies the number of pages in a particular site that contain Xi. The XPath-DF of a particular XPath indicates the representativeness of the XPath itself, and a page's representativeness score is directly proportional to the representativeness of each XPath present in the page. For example, Site A might have three structural variations for the key attribute “Title” across the pages of the site. As a non-limiting example of a structural variation for a particular attribute, the pages of a site might be inconsistent with respect to the XPath at which the particular attribute is found. Thus, in the case of Site A, the attribute “Title” is associated with X1, X2, and X3 on various different pages. If X1 has the highest XPath-DF of the three variations associated with the attribute “Title,” then the pages containing X1 should be given preference over the pages containing X2 and X3. This preference is because an annotation of X1 will be informative about more pages in Site A than an annotation of X2 or X3. In other words, pages including X1 will provide a higher recall than the other pages in the site with respect to the attribute “Title.” Thus, preference of pages including X1 will aid in achieving maximum recall with minimum annotations with respect to the attribute “Title.” Outlier pages, such as a frequently asked questions page in a product page cluster, generally have very low XPath-DF and hence may get a low page representativeness score, either pushing the outlier page to the bottom of the sample list or eliminating the page.
Yet another embodiment of the invention computes structural information in terms of the importance of an XPath (XPath-Imp). Web pages are structured to contain not only informative content like product information in a shopping domain, or job information in a job domain, but also content like navigation panels and copyright information. A navigation panel and other such content is considered to be mere noise from an information extraction point of view because the information presented by a navigation panel is presented for the purpose of navigating though pages of the site, and not because the information is particularly informative.
Any particular instance of an XPath is associated with a particular content item displayed to a viewer upon display of the document in which the XPath occurs. For example,
In order to differentiate between informative and noisy XPaths and to assign XPaths differently weighted importance scores accordingly, it is assumed that, in a particular web site, noisy XPaths share common structure and content, while informative XPaths differ in actual content and/or structure. Thus, the importance score of a particular XPath, Xi, is defined in the following Eq. 1:
where t denotes a particular content item; DF(X1, t) denotes the number of documents containing both Xi and t together; T denotes a set of unique content items associated with XPath, Xi; and N denotes the number of documents in the subject site that have not yet been annotated, which is a subset of the total M pages in the subject site.
Eq. 1 measures the average of the fraction of times each content item, t, is associated with a particular XPath, Xi. Eq. 1 then inverts the average to get the importance score for XPath, Xi. Thus, Eq. 1 assigns a low importance score to Xi if the XPath has common content across pages, i.e., is a noisy XPath. This technique effectively downplays noisy portions of Web pages. Conversely, Eq. 1 assigns a higher importance score to Xi if the XPath has distinct content across the pages of a site because such a diversity of content associated with an XPath indicates that the XPath belongs to an informative region of a document.
As previously stated, information regarding the XPaths, or structures, of the pages of a site is used to produce representativeness scores for each document in the site. To produce a representativeness score for a document, the information for the document and the site are input into a document ranking formula. The problem of finding representativeness scores for the documents of a site is similar to the problem of ranking documents according to each document's relevance to a given query, as with search engines. Therefore, a formula used to rank documents based on a search query can be modified and used to produce representativeness scores.
The Okapi BM25 measure is one of the popular measures to compute document relevance in the context of query searches. Okapi BM25 is a ranking function based on a probabilistic retrieval framework that is used to rank documents matching a given query according to the relevance of each document to the given query. As with many ranking functions for search queries, the relevance of a document is determined by BM25 using the term frequency of the query terms in the document, the document frequency of the query terms, and the length of the document. In this context, a query term's term frequency (TF) indicates the number of times the query term occurs in a particular document, and a query term's document frequency (DF) indicates the number of documents out of the set of documents being searched that contain the search query. Thus, given a long query Q, containing keywords {q1, . . . , qn}, the BM25 relevance score of a document Dj is determined according to Eq. 2:
where N denotes the total number of documents in the document collection being queried; DFi denotes the document frequency of the query term qi; TFij denotes the term frequency of query term qi in document Dj; TFiq denotes the term frequency of qi in long query Q, which indicates how many times qi appears in long query Q; Lj denotes the length of document Dj; and Lavg denotes the average document length in the document collection being queried.
The term k1 is defined as a tuning parameter (0≦k1≦∞) that calibrates the document term frequency scaling. In other words, adjusting k1 adjusts the importance placed on the quantity of a query term in a document. A k1 value of zero corresponds to a binary model (no term frequency) that detects only the presence of a query term in a document and places no importance on the number of times the query term occurs in the document. A large k1 value corresponds to using raw term frequency, which places a higher weight on documents containing more of the query term. Also, b is defined to be a tuning parameter (0≦b≦1) that determines the scaling of the query term by the length of the particular document. If b=1, then the term weight is fully scaled by document length, and if b=0, then there is no length normalization. Finally, k3 is defined as a tuning parameter that calibrates the term frequency scaling of the query.
The Okapi BM25 works well in an information retrieval framework for computing the relevance score of a document, given a query. For such a formula to correctly compute representativeness scores in the context of the document sampling problem of the embodiments of the invention, some parameters must be changed or removed. In the context of the classic Okapi BM25 measure, the score of a document is inversely proportional to the document frequency of the query term. However, in the context of the embodiments of this invention, the scoring function should consider the representativeness score of a document to be proportional to XPath-DF and XPath-Imp, as opposed to inversely proportional as with the classic BM25. Also, with the classic BM25 measure, the query's term frequency scaling parameter, k3 is required because the long query might contain repeating terms. However, the “query” in the context of the embodiments of this invention consists of all unique XPaths, and the tuning parameter k3 is not required. Thus, the modified BM25 measure to determine the representativeness score of documents in a site is represented in Eq. 3 below:
Eq. 3 receives as input both a particular document Dj to score and the set, XS, of all unique XPaths not in a document already selected for human annotation. Thus, if no documents have been selected for annotation, XS represents the set of all unique XPaths in the subject site comprising the collection of all N documents. Lj denotes the length of document Dj, in terms of the XPaths of the document, i.e., the number of uncovered XPaths present in the document. Also, Lavg denotes the average document length of all N documents in term of XPaths, i.e., the average number of uncovered XPaths per document of the site.
As explained with respect to the flowchart of
The modified Okapi BM25, as explained, enables a greedy solution to the page sampling problem because the formula calculates the representativeness score for each of a set of documents from a site based on the set of unique XPaths present in the documents, from which the most representative document can be identified by the representativeness scores of the documents, i.e., the highest score. In one embodiment of the invention, if more than one document have the same maximum score, then the tie is broken by selecting the first document with the maximum score. With reference to
In one embodiment of the invention, active sampling is used to refine the sample list produced by the embodiments of the passive sampling technique by utilizing information derived from human annotations. For example, the data attributes on an annotated page that are not identified in the human annotations are revealed to be uninteresting. As a further example, the spatial regions of a document that are annotated by humans, or the least common ancestor of the XPaths annotated by humans, are also revealed to be interesting. Also, information on attributes annotated in every human-annotated document from a site can be used to identify trends in the pages of the site. Thus, after each page is annotated by a person, the sample list is actively refined based on the information provided by the annotations.
In another embodiment of the invention, information on key attributes derived from human annotations is utilized to refine the list of unique uncovered XPaths used, in the passive sampling technique, to calculate representativeness scores. Human annotations generally consist of identifications of interesting attributes on a page. For example, page 600 in
Using this new information, the active sampling technique recalculates the representativeness score for each document in site “autos.yahoo.com” that has not yet been annotated. This recalculation is done according to the passive sampling technique, as illustrated in
In one embodiment of the invention, individual XPaths are identified as uninteresting based on human annotations. If a particular content item in a particular page goes unannotated, and the information for only one product is presented by the page, then the particular item is identified as uninteresting. For example, in the context of the web pages illustrated by pages 600 and 700, page 700 represents only one product, and therefore, user ratings 705 is identified as uninteresting because it was not annotated. Thus, the XPath corresponding to user ratings 705 is removed from consideration when recomputing the representativeness scores of the documents in Ya because this attribute 705 is uninteresting. This refinement of the list of XPaths considered in calculating representativeness scores according to the embodiments of the passive sampling technique ensures that the representativeness score of a document is not boosted based on the presence of uninteresting attributes in the document.
In yet another embodiment of the invention, interesting spatial regions of an annotated document can be identified based on the location of the annotations on the document. For example, annotated page 700 is delineated into spatial regions by an automatic region identifier, as shown on page 900 of
In this embodiment of the invention, active sampling recalculates the representativeness score of each of the unannotated documents in the subject site using the information on interesting spatial regions. Specifically, each document of the set of documents in the subject site that has not yet been annotated is evaluated to identify the spatial regions in the document. If the document, i.e., page 1000 of
If a document, i.e., page 1100 of
In one embodiment of the invention, a spatial region in an unannotated document is identified as corresponding to an interesting spatial region of an annotated document through the use of Least Common Ancestor (LCA). In this embodiment of the invention, the LCA of XPaths corresponding to annotated attributes is computed. If the LCA of XPaths of the annotated attributes is found in the unannotated document, then the XPaths corresponding to the LCA in the unannotated document are considered to be in an interesting spatial region. In another embodiment of the invention, visual information about an annotated spatial region is gathered, i.e., x- and y-coordinates, height, width, etc., and an unannotated document is searched to determine if the document has a corresponding spatial region based on the gathered visual information and annotated XPaths.
Another embodiment of the invention identifies mandatory attributes among the pages of a site and makes decisions of whether to include a particular page in the list of sample pages to be annotated based on the known mandatory attributes, as illustrated by
In this embodiment of the invention, those documents that contain all of the XPaths corresponding to the identified mandatory key attributes are removed from the list of sample documents to be annotated because it is likely that nothing more can be learned from documents with all of the mandatory attributes. However, if a document is apparently missing an XPath for a mandatory attribute, then the document is surfaced for human annotation because a missing mandatory attribute is indicative of an unknown structural variation having to do with the missing mandatory attribute. Annotating such a document will likely add to what is known about the structure of the site, especially with respect to the missing mandatory attribute. Therefore, in step 1202 of
With respect to identification of interesting spatial regions, the computation of representativeness scores is restricted to XPaths in the spatial regions identified to be interesting. Thus, if a particular document has one or more spatial regions identified as interesting, then mandatory attributes are only sought in those interesting spatial regions. If the interesting spatial regions of a document do not contain the XPath of each mandatory attribute identified for the site, and the document has additional XPaths occurring inside of the interesting spatial regions, then the document is considered to have a mandatory attribute with a different XPath than the XPath that has been previously identified as associated with the missing mandatory attribute. Such documents are scored using active sampling technique and the document with highest score is surfaced to human for annotation.
For active sampling to be effective, the first document selected by the embodiments of the passive sampling technique ideally covers the majority of the key attributes in the subject site because the embodiments of the active sampling technique consider annotated attributes to refine the sample list. If the pages being annotated are not in order of representativeness, then active sampling may detrimentally ignore regions of a document that contain interesting information.
According to one embodiment, the techniques described herein are implemented by one or more special-purpose computing devices. The special-purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques. The special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
For example,
Computer system 1400 also includes a main memory 1406, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 1402 for storing information and instructions to be executed by processor 1404. Main memory 1406 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1404. Such instructions, when stored in storage media accessible to processor 1404, render computer system 1400 into a special-purpose machine that is customized to perform the operations specified in the instructions.
Computer system 1400 further includes a read only memory (ROM) 1408 or other static storage device coupled to bus 1402 for storing static information and instructions for processor 1404. A storage device 1410, such as a magnetic disk or optical disk, is provided and coupled to bus 1402 for storing information and instructions.
Computer system 1400 may be coupled via bus 1402 to a display 1412, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 1414, including alphanumeric and other keys, is coupled to bus 1402 for communicating information and command selections to processor 1404. Another type of user input device is cursor control 1416, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1404 and for controlling cursor movement on display 1412. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
Computer system 1400 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 1400 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 1400 in response to processor 1404 executing one or more sequences of one or more instructions contained in main memory 1406. Such instructions may be read into main memory 1406 from another storage medium, such as storage device 1410. Execution of the sequences of instructions contained in main memory 1406 causes processor 1404 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The term “storage media” as used herein refers to any media that store data and/or instructions that cause a machine to operation in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 1410. Volatile media includes dynamic memory, such as main memory 1406. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1402. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 1404 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 1400 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 1402. Bus 1402 carries the data to main memory 1406, from which processor 1404 retrieves and executes the instructions. The instructions received by main memory 1406 may optionally be stored on storage device 1410 either before or after execution by processor 1404.
Computer system 1400 also includes a communication interface 1418 coupled to bus 1402. Communication interface 1418 provides a two-way data communication coupling to a network link 1420 that is connected to a local network 1422. For example, communication interface 1418 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 1418 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 1418 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Network link 1420 typically provides data communication through one or more networks to other data devices. For example, network link 1420 may provide a connection through local network 1422 to a host computer 1424 or to data equipment operated by an Internet Service Provider (ISP) 1426. ISP 1426 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the “Internet” 1428. Local network 1422 and Internet 1428 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 1420 and through communication interface 1418, which carry the digital data to and from computer system 1400, are example forms of transmission media.
Computer system 1400 can send messages and receive data, including program code, through the network(s), network link 1420 and communication interface 1418. In the Internet example, a server 1430 might transmit a requested code for an application program through Internet 1428, ISP 1426, local network 1422 and communication interface 1418.
The received code may be executed by processor 1404 as it is received, and/or stored in storage device 1410, or other non-volatile storage for later execution.
In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. Thus, the sole and exclusive indicator of what is the invention, and is intended by the applicants to be the invention, is the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction. Any definitions expressly set forth herein for terms contained in such claims shall govern the meaning of such terms as used in the claims. Hence, no limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of such claim in any way. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
This application is related to U.S. patent application Ser. No. 12/030,301, filed on Feb. 13, 2008, entitled “ADAPTIVE SAMPLING OF WEB PAGES FOR EXTRACTION”, the entire content of which is incorporated by reference for all purposes as if fully disclosed herein. This application is related to U.S. patent application Ser. No. 12/346,483, filed on Dec. 30, 2008, entitled “APPROACHES FOR THE UNSUPERVISED CREATION OF STRUCTURAL TEMPLATES FOR ELECTRONIC DOCUMENTS”, the entire content of which is incorporated by reference for all purposes as if fully disclosed herein.