The rapid growth of the World Wide Web is making web forums (also called bulletin or discussion boards) an important data resource on the Web. With millions of users' contributions, plenty of highly valuable information has been accumulated on various topics. As a result, recent years have witnessed increased research efforts trying to leverage information extracted from forum data to build various web applications.
For most web applications, the fundamental step is to fetch data pages from various web sites distributed on the Internet via web crawling and to extract structured data from unstructured pages. Extracting structured data from unstructured forum pages represented in Hypertext Markup Language (HTML) format is done by removing useless HTML tags and noisy content like advertisements. Structured data on web forum sites includes data such as, for example, post title, post author, post time, and post content. However, automatically extracting structured data is not a trivial task due to both complex page layout designs and unrestricted user created posts. This has become a major hindrance for efficiently using web forum data. For web forums, different forum sites usually employ different templates.
In general, web data extraction approaches can be classified into two categories: template-dependent and template-independent. Template-dependent methods, just as the name implies, try to utilize a wrapper as an extractor for a set of web pages which are generated based on the same layout template. Template-independent methods usually treat data extraction as a segmentation problem and employ probabilistic models to integrate more semantic features and sophisticated human knowledge.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The web forum data extraction technique described herein is a template-independent approach specifically designed for structured data extraction of data on web forums. To provide a more robust and accurate extraction performance, the technique incorporates both page-level information and site-level knowledge. To do this, in one embodiment, the technique finds the kinds of page objects a forum site has, which object a page belongs to, and how different page objects are connected with each other. This information can be obtained by re-constructing the sitemap of the target forum. A sitemap is a directed graph in which each vertex represents one page object and each arc denotes a linkage between two vertices. The technique can identify vertices of list, post, and user profile from most forum sitemaps automatically. In one embodiment of the technique, the web forum data extraction technique collects three kinds of evidence for data extraction: 1) inner-page features which cover both semantic and layout information on an individual page; 2) inter-vertex features which describe linkage-related observations; and 3) inner-vertex features which characterize interrelationships among pages in one vertex. Finally, the technique employs Markov Logic Networks (MLNs) to combine all of these types of evidence (e.g., features) statistically for inference. By integrating all of the kinds of evidence and learning their importance, MLNs can handle uncertainty and tolerate imperfect and contradictory knowledge in order to extract desired data.
In the following description of embodiments of the disclosure, reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the technique may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the disclosure.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of the web forum data extraction technique, reference is made to the accompanying drawings, which form a part thereof, and which show by way of illustration examples by which the web forum data extraction technique described herein may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.
1.0 Web Forum Data Extraction Technique.
In the following sections, some background information on web data extraction, the operating environment, and definitions of terms for the web forum data extraction technique is provided. Additionally, an overview of the technique is followed by details and exemplary embodiments.
1.1 Background
In general, web data extraction approaches can be classified into two categories: template-dependent and template-independent. Template-dependent methods, just as the name implies, try to utilize a wrapper as an extractor for a set of web pages which are generated based on the same layout template. Template-independent methods usually treat data extraction as a segmentation problem and employ probabilistic models to integrate more semantic features and sophisticated human knowledge. More specifically, a wrapper is usually represented in the form of regular expression or tree structure. Such a wrapper can be manually constructed, semi-automatically generated by interactive learning, or even discovered fully automatically. Most of web data extraction approaches utilize structure information of the Document Object Model (DOM) tree structure of a typical HTML page. However, for web forums, different forum sites usually employ different templates or wrappers. Even forums built with the same forum software have various customized templates. Additionally, most forum sites periodically update their templates for data extraction to provide an improved user experience. Therefore, the cost of both generating and maintaining wrappers for so many (maybe tens of thousands of) forum templates is extremely high and makes it impractical in real applications. Furthermore, wrapper-based methods also suffer from noisy and unrestricted data in forums.
To provide a more general solution for web data extraction, template-independent methods have been proposed. These approaches generally treat data extraction as a segmentation problem, and employ probabilistic models to integrate more semantic features and sophisticated human knowledge. Therefore, template-independent methods have little dependence on specific templates. In practice, existing template-independent methods of data extraction depend on features inside an individual page of a website, and separately infer each input page for extraction. For most applications, the page-level information is sufficient and the single page-based inference is also practical. However, for forum data extraction only adapting page-level information is not enough to deal with both complex layout designs and unrestricted user created posts in Web forums.
1.2 Operating Environment and Definitions
To facilitate the following discussions, the operating environment of web forum sites and associated definitions are briefly explained.
In forum data extraction, one usually needs to extract information from several kinds of pages such as list pages and post pages, each of which may correspond to one kind of data object. Pages of different objects are linked with each other. For most forums, such linkages are usually statistically stable, which can support some basic assumptions and provide additional types of evidence for data extraction. For example, if a link points to a page of user profile, the anchor of this link is very likely an author name. Secondly, the interrelationships among pages belonging to the same object can help verify the misleading information existing in some of individual pages. For example, although user-submitted HTML codes on some post pages may bring ambiguities in data extraction, a joint inference cross multiple post pages can help an extractor distinguish such noise. The linkages and interrelationships, both of which are dependent on site-structure information beyond a single page, are called site-level knowledge herein.
1.2.1 Sitemap. A sitemap is a directed graph consisting of a set of vertices and the corresponding links. Each vertex represents a group of forum pages which have similar page layout structure; and each link denotes a kind of linkage relationship between two vertices.
1.2.2 List Page. For users' convenience, a well-organized forum site consists of a tree-like directory structure containing topics (commonly called threads) at the lowest end and posts inside threads. For example, the tree of the exemplary forum is a four-level structure shown in the dashed rectangle in
1.2.3 Post Page. Pages in the leaf node on the tree are called post pages 110, which contain detailed information of user posts. Each post usually consists of fields such as post author, post time, and post content, which are the goal of data extraction.
1.2.4 Exemplary Forum Data Extraction Definition. One can formally define the problem of web forum data extraction for one exemplary embodiment of the web forum data extraction technique as:
1.3 Overview
One high level exemplary schematic of the web forum data extraction technique is illustrated in
7 In one embodiment of the technique, the goal of the first block 202 is to automatically estimate the sitemap structure of a target forum site 208 (e.g., one for which data is sought-to be extracted) using a few sampled pages 210. In practice, it was found that sampling around 2000 pages is enough to re-construct the sitemap of most forum sites. Pages with similar layout structures are further clustered into groups (vertices). Then, all possible links among various vertices are established if in the source vertex there is a page having an out-link pointing to a page in the target vertex. (For purposes of explanation, for a given link, the page which contains this link is called the source page. The page which the link navigates to is called the target page. The vertex which the source page belongs to is called the source vertex and the vertex which the target page belong to is called the target vertex._Each link is described by both a Uniform Resource Locator (URL) pattern and a location (the region where the corresponding out-links located). For example, the URL may consists of several tokens. Some different URLs may share some similar tokens. These tokens are called the URL pattern. One can use these pattern to describe the relations among these URLs. Finally, since some long lists or long threads may be divided into several individual pages connected by page-flipping links, the web forum data extraction technique can archive them together by detecting the page-flipping links and treating all entries on pages connected by page-flipping links as a single page. (Generally, a page-flipping link is a link that links continuing pages of a website). This greatly facilitates the following data extraction.
The second block depicts the feature extraction that takes place (block 204). In one embodiment of the technique, there are three kinds of features according to their generation source: (1) Inner-page features which leverage the relations among the elements inside a page, such as the size and location of each elements, an alignment relation, the inclusion relation among elements and the sequence order of elements; (2) Inter-template features which are generated based on the above site-level knowledge. Links with similar functions usually navigate to the same vertex on the sitemap, such as the list title which usually navigates to the vertex containing post pages. The web forum data extraction technique can get the function for each link based on its location. This is a very helpful feature to tag the correct labels to the corresponding elements; and (3) Inter-page features. For pages in a given vertex, records with same semantic labels (title, author, and etc.) should be presented in the same location in these pages. The web forum data extraction technique employs such features to improve the feature extraction results of pages in the same vertex.
Once the above-described features are obtained, in one embodiment, to combine these features effectively, the technique utilizes a Markov Logic Networks (MLNs) to model the aforementioned relation data. The web forum data extraction technique uses a joint inference model 206 to predict the location of the desired data structures (e.g., post title, post author, post time and post content). Markov logic networks provide a general probabilistic model for modeling relational data. MLNs have been applied to joint inference under different scenarios, such as segmentation of citation records and entity resolution. By the joint inference of pages inside one same vertex, the web forum data extraction technique can integrate all the three feature types and compute a maximum a posteriori (MAP) probability of all query evidences. This probability can be used to extract the desired data and optionally store it in a database 218.
1.4 Details and Exemplary Embodiments
An overview of the technique having been provided, in this section, the details of the above-described steps of various embodiments of the web forum data extraction technique are described. The details include information on Markov Logic Networks (MLNs), as well as the specifics of the features used in extracting data.
1.4.1 Markov Logic Networks-Mathematical Description
In one embodiment of the web forum data extraction technique, such as for example as shown in
A MLN can be viewed as a template for constructing Markov Random Fields. With a set of formulas and constants, MLNs define a Markov network with one node per ground atom and one feature per ground formula. The probability of a state x in such a network is given by
where Z is a normalization constant, ni(x) is the number of true groundings of Fi in x,x(i) is the state (truth values) of the atoms appearing in Fi and φi(x(i)=ew
Eq. (1) defines a generative MLN model, that is, it defines the joint probability of all the predicates. In one embodiment of the web forum data extraction technique for forum page segmentation, the evidence predicates and the query predicates are known a prior. Thus, the technique turns to employing a discriminative MLN. Discriminative models have the great advantage of incorporating arbitrary useful features and have shown great promise as compared to generative models. The web forum data extraction technique partitions the predicates into two sets the evidence predicates X and the query predicates Q. Given an instance x, the discriminative MLN defines a conditional distribution as follows:
where FQ is the set of formulas with at least one grounding involving a query predicate, Gi is the set of ground formulas of the ith first-order formula, and Zx(w) is the normalization factor. gj(q,x) is binary and equals to 1 if the jth ground formula is true and 0 otherwise.
With the conditional distribution in Eq. (2), web data extraction is a task to compute maximum a posteriori (MAP) probability of query predicate q and extract data from this assignment q*:
In one embodiment of the web forum data extraction technique, the technique mainly focuses on extracting the following six objects, list record, list title, post record, post author, post time, and post content. The atomic extraction units are HTML elements. Thus, in the MLN model, the technique defines the corresponding query predicates q as, IsListRecord(i), IsTitleNode(i), IsPostRecord(i), IsAuthorNode(i), IsTimeNode(i), and IsContentNode(i), respectively, where i denotes the ith element. The evidence x are the features of the HTML elements. In a discriminative MLN model as defined in Eq. (2), the evidence x can be arbitrary useful features. In one embodiment of the technique, the features include three types: inner-page features (e.g., the size and location of each element), inter-template features (e.g., the alignment relation among elements) and inter-page features (e.g., the order among some time-like elements). With these predefined features, the technique in one embodiment employs rules or the formulas in MLNs, (e.g., such as the post record element must contain post author, post time and post content nodes, among others) to define inter-relationships between objects. These formulas represent relationships among HTML elements. With these formulas, the resultant MLN can effectively capture the mutual dependencies among different extractions and thus achieve a globally consistent joint inference.
Note that in the above general definition, the technique can treat all of the HTML elements identically when formulating the query and evidence predicates. However, in practice, HTML elements can show obviously different and non-overlapping properties. For example, the elements staying at the leaves of a DOM tree are quite different from the inner nodes. Only the elements at leaf nodes can be a post author or a post time; only the inner elements can be list record or a post record. Thus, the technique can group these elements into three non-overlapping groups, as will be discussed in more detail later. This can be implemented in an MLN model by defining them as different types. In this way, the web forum data extraction technique can significantly reduce the number of possible groundings when MLN is performing inference. Also, this prior grouping knowledge can reduce the ambiguity in the model and thus achieve better performance.
1.4.2 Features
The following paragraphs describe the categories of features and a description of the types of features (inner-page, inter-vertex and inner-vertex) used in training a joint inference model and using it to identify desired data employed in one embodiment of the web forum data extraction technique.
1.4.2.1 Categories of Features
To accelerate the training and inference process, the DOM tree elements are divided into the following three categories according to their attributes. These include text elements, hyperlink elements and inner elements.
(a) Text element (t). Text elements always acts as leaves in DOM trees and ultimately contain all of the extracted information. For some plain text information like post time, the technique identifies this kind of element in data extraction.
(b) Hyperlink element (h). Hyperlink elements correspond to hyperlinks in a web page which usually have tags (e.g., <a>) in HTML files. Web pages inside a forum are connected to each other through hyperlinks. For example, list pages and post pages are linked together by hyperlinks of post titles pointing from the former to the latter. Inside a forum site, some desired information such as, for example, post title and post author, is always enveloped in hyperlink elements.
(c) Inner element (i). All the other elements besides text elements and hyperlink elements located inside a DOM tree are defined as inner elements or inner nodes. In practice, the list records or post records and post contents are always embraced in inner elements.
In one embodiment, the MLN model, the web forum data extraction technique treats the above three kinds of evidence (text, hyperlink and inner element) separately to accelerate the training and inference process. In the following paragraphs, the above three kinds of evidence will be represented as t, h, and i, respectively. The corresponding features are listed in Table 1.
1.4.2.2 Inner-Page Features
Inner-page features leverage the relations among elements inside a page; and are listed in Table 1. These features correspond to block 212 of block 204 in
1.4.2.3 Inter-Vertex Features
The inter-vertex features are generated based on site-level knowledge. In a sitemap, the pages inside a given vertex usually have similar functions, as shown in
1.4.2.4 Inner-Vertex Features
In general, for different pages of the same vertex in the sitemap of a forum, the records of the same semantic labels (title, author and etc.) should be presented in the same DOM path. In one embodiment, the technique employs these alignment features to further improve the results within a set of pages belonging to the same template. These features can be leveraged for the three kinds of elements i, h, and t, respectively. These can be represented as InnerAlignIV(i,i′), HyperAlignIV(h,h′), and TextAlignlV(t,t′). These features are also listed in Table 1 and correspond to block 216 of block 204 in
1.4.3 Formulas
In this section, the detailed formulas used in the two models of list pages and post pages, respectively, of one embodiment of the technique, are described in more detail.
1.4.3.1 Formulas of List Page
In one embodiment of the web forum data extraction technique, it is assumed that list records should be inner nodes and list titles should be contained in hyperlink nodes. In order to extract them accurately, one embodiment of the technique introduces some rules which are presented as the following formulas. There are two kinds of rules which basically present the relations among the queries and the evidences. The relations for list record and list title are shown in
(1) Formulas for identifying list record. A list record usually contains a link of list title which also appears repeatedly. One can identify list record if a candidate element is aligned with a known list record inside a page 302 or aligned with a known list record in another page 304 of the same vertex. This is shown in
∀i, ContainPostLink(i)IsRepeatNode(i) IsListRecord(i) (4)
∀i, i′, IsListRecord(i)InnerAlign(i, i′) IsListRecord(i′) (5)
∀i, i′, IsListRecord(i)InnerAlignIV(i, i′) IsListRecord(i′) (6)
(2) Formulas for identifying list title. A list title usually contains a link to a vertex of post pages and is contained in list record. Equation (8) is useful when site level information is not available. It is also possible to identify list title if a candidate element is aligned with a known list title inside a page 306 or aligned with a known list title in another page 308 of the same vertex. This is also shown in
∀i, h, IsListRecord(i)HasPostLink(i, h) IsTitleNode(h) (7)
∀i, h, IsListRecord(i)HasLongestLink(i, h) IsTitleNode(h) (8)
∀i, h′, IsTitleNode(h)A HyperAlign(h, h′) IsTitleNode(h′) (9)
∀i, h′, IsTitleNode(h)HyperAlignIV(h, h′) IsTitleNode(h′) (10)
1.4.3.2 Formulas of Post Page
A post record and post content should be contained in inner nodes, while a post author should be contained in hyperlink nodes and post time always appears in a text node as time-format. One embodiment of the technique can identify, the desired information by inferring these predicates and using some established rules to describe the required elements according to their own evidences. The relations among post record, post author, post time, and post content respectively are also drawn in the
(1) Formulas for identifying Post record. A post record usually contains a link for post author and post time and appears repeatedly. The technique will also identify post record if a candidate element is aligned with a known post record inside a page 402 or is aligned with a known post record in another page 404 of the same vertex. This shown in
∀i, ContainAuthor Link(i)ContainTimeNode(i) IsRepeatNode(i)IsPostRecord(i) (11)
∀i, i′,IsPostRecord(i)InnerAlign(i, i′) IsPostRecord(i′) (12)
∀i, i′, IsPostRecord(i)InterInnerAlign(i, i′) IsPostRecord(i′) (13)
(2) Formulas for identifying post author. A post author usually contains a link to the vertex of profile pages and is contained in a post record. The technique identifies a post author if a candidate element is aligned with a known post author inside a page 406 or is aligned with a known post author in another page 408 of the same vertex. This is also shown in
∀iIsPostRecord(i, h)HasAuthorLink(i, h)IsAuthorNode(h) (14)
∀h, h′, IsAuthorNode(h)HyperAlign(h, h′) IsAuthorNode(h′) (15)
∀h,h′,IsAuthorNode(h)HyperAlignlV(h, h′) IsAuthorNode(h′) (16)
(3) Formulas for identifying Post time. A post time usually contains time-format content and is sorted ascendently or descendently. The technique will also identify post time if a candidate element is aligned with a known post time inside a page 502 or aligned with a known post time in another page 504 of the same vertex. This is shown in
∀iUnderSameOrder(t)IsTimeNode(t) (17)
∀t,t′IsTimeNode(t)TextAlign(t,t′) IsTimeNode(t′) (18)
∀t,t′IsTimeNode(t)TextAlignIV(t,t′) IsTimeNode(t′) (19)
(4) Formulas for identifying post content. Post content is usually the descendant of a post record and does not contain post time and post author. The technique identifies post content if a candidate element is aligned with a known post content inside a page 506 or aligned with known post content in another page 508 of the same vertex. This is also shown in
∀i, i′, IsRepeatNode(i)HasDescendant(i, i′)A ContainLongText(i′)ContainTimeNode(i′)ContainHyperLinkAuthor(i′))IsContentNode(i′) (20)
∀i, i′, IsContentNode(i)InnerAlign(i, i′) IsContentNode(i′) (21)
∀i, i′, IsContentNode(i)InnerAlignlV(i, i′) IsContentNode(i′) (22)
The overview and details of various implementations of the web forum data extraction technique having been discussed, the next sections provide exemplary embodiments of processes and an architecture for employing the technique.
An exemplary process 600 employing the web forum data extraction technique is shown in
The web forum data extraction technique is designed to operate in a computing environment. The following description is intended to provide a brief, general description of a suitable computing environment in which the web forum data extraction technique can be implemented. The technique is operational with numerous general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable include, but are not limited to, personal computers, server computers, hand-held or laptop devices (for example, media players, notebook computers, cellular phones, personal data assistants, voice recorders), multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
Device 900 also contains communications connection(s) 912 that allow the device to communicate with other devices and networks. Communications connection(s) 912 is an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, thereby changing the configuration or state of the receiving device of the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
Device 900 may have various input device(s) 914 such as a display, a keyboard, mouse, pen, camera, touch input device, and so on. Output device(s) 916 such as speakers, a printer, and so on may also be included. All of these devices are well known in the art and need not be discussed at length here.
The web forum data extraction technique may be described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, and so on, that perform particular tasks or implement particular abstract data types. The web forum data extraction technique may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should also be noted that any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.