Web pages provide an inexpensive and convenient way to make information available to its consumers. However, as the inclusion of multimedia content, embedded advertising, and online services becomes increasingly more prevalent in modem web pages, the web pages themselves have become substantially more complex. For example, in addition to their main content, many web pages display auxiliary content such as background imagery, advertisements, or navigation menus, and links to additional content.
It is often the case that owners or consumers of web pages wish to utilize or adapt only a portion of the information presented in a web page. For instance, a user may desire to print a physical copy of an internet article without reproducing any of the irrelevant content on the web page containing the article. Similarly, an owner of a web page may wish to adapt a web page into another document, such as a marketing brochure, without including content in the web page that is superfluous to the new document. Such uses of only a portion of the content presented in a web page can require tedious effort on the part of a user to distinguish among the different types of content on the web page and retrieve only the desired content.
The accompanying drawings illustrate various embodiments of the principles described herein and are a part of the specification. The illustrated embodiments are merely examples and do not limit the scope of the claims.
Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
The present specification discloses various methods, systems, and devices for segmenting a web page into coherent functional blocks. The methods, systems, and devices disclosed in the present specification accomplish this goal by parsing the web page into a plurality of coherent and collectively exhaustive nodes, calculating at least one matrix of affinity values between the separate nodes; and clustering the nodes into functional areas based on the at least one matrix of affinity values.
The web page segmentation process described herein segments a web page into a number of meaningful function or logical blocks. These functional blocks can be advantageously used to, for example, extract only the content from a web page that is useful to a specific application. In additional or alternative examples, the functional blocks may be advantageously used to preserve the visual continuity of content when reformatting or applying a new layout to the web page.
As used in the present specification and in the appended claims, the term “web page” refers to a document that can be retrieved from a server over a network connection and viewed in a web browser application.
As used in the present specification and in the appended claims, the term “node” refers to one of a plurality of coherent units into which the entire content of a web page has been partitioned.
As used in the present specification and in the appended claims, the term “collectively exhaustive,” as applied to a node, refers to the property wherein all such nodes for a particular web page comprise in their sum the totality of content displayed on that web page.
As used in the present specification and in the appended claims, the term “coherent,” as applied to a node, refers to the characteristic of having content only of the same type or property.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present systems and methods. It will be apparent, however, to one skilled in the art that the present systems and methods may be practiced without these specific details. Reference in the specification to “an embodiment,” “an example” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least that one embodiment, but not necessarily in other embodiments. The various instances of the phrase “in one embodiment” or similar phrases in various places in the specification are not necessarily all referring to the same embodiment.
The principles disclosed herein will now be discussed with respect to illustrative systems, devices, and methods for semantically ranking content in a web page.
Referring now to
The web page segmentation device (105) of the present example is a computing device configured to retrieve the web page (110) hosted by the web page server (115) and divide the web page (110) into multiple coherent, functional blocks. In the present example, this is accomplished by the web page segmentation device (105) requesting the web page (110) from the web page server (115) over the network (120) using the appropriate network protocol (e.g., Internet Protocol (“IP”)). Illustrative processes of segmenting the web page content will be set forth in more detail below.
To achieve its desired functionality, the web page segmentation device (105) includes various hardware components. Among these hardware components may be at least one processing unit (125), at least one memory unit (130), peripheral device adapters (135), and a network adapter (140). These hardware components may be interconnected through the use of one or more busses and/or network connections.
The processing unit (125) may include the hardware architecture necessary to retrieve executable code from the memory unit (130) and execute the executable cede. The executable code may, when executed by the processing unit (125), cause the processing unit (125) to implement at least the functionality of retrieving the web page (110) and semantically segmenting the web page (110) into coherent functional blocks according to the methods of the present specification described below. In the course of executing code, the processing unit (125) may receive input from and provide output to one or more of the remaining hardware units.
The memory unit (130) may be configured to digitally store data consumed and produced by the processing unit (125). The memory unit (130) may include various types of memory modules, including volatile and nonvolatile memory. For example, the memory unit (130) of the present example includes Random Access Memory (RAM), Read Only Memory (ROM), and Hard Disk Drive (HDD) memory. Many other types of memory are available in the art, and the present specification contemplates the use of any type(s) of memory (130) in the memory unit (130) as may suit a particular application of the principles described herein. In certain examples, different types of memory in the memory unit (130) may be used for different data storage needs. For example, in certain embodiments the processing unit (125) may boot from ROM, maintain nonvolatile storage in the HDD memory, and execute program code stored in RAM.
The hardware adapters (135, 140) in the web page segmentation device (105) are configured to enable the processing unit (125) to interface with various other hardware elements, external and internal to the web page segmentation device (105). For example, peripheral device adapters (135) may provide an interface to input/output devices to create a user interface and/or access external sources of memory storage. Peripheral device adapters (135) may also create an interface between the processing unit (125) and a printer (145) or other media output device. For example, in embodiments where the web page segmentation device (105) is configured to generate a document based on functional blocks extracted from the web page's content, the web page segmentation device (105) may be further configured to instruct the printer (145) to create one or more physical copies of the document.
A network adapter (140) may provide an interlace to the network (120), thereby enabling the transmission of data to and receipt of data from other devices on the network (120), including the web page server (115).
Referring now to
In the example of
In the present example, a URL (201) for a web page is received by a web page receiving module (205). For example, the web page receiving module (205) may perform the functions of fetching the web page from its server and rendering the web page to determine a layout of the content in the web page. The URL (201) may be specified by a user of the web page segmentation device (105,
Certain properties are desirable for the nodes resulting from the decomposition of the web page. The nodes should be atomic; in other words, the nodes should never have to be broken up into smaller pieces. The nodes should also be collectively exhaustive such that all nodes collectively contain all of the content visible in the web page. It is also very desirable that each node be coherent (i.e., contains content of the same property) and mutually exclusive (i.e., no two nodes contain the same content).
Many methods of decomposing web page content into nodes having the above properties are available or pending development. Any suitable method of decomposing web page content into such nodes is commensurate with the scope of the present specification. Decomposition criteria (215) may be provided to the decomposition module (210) to effect a desired method of web page decomposition.
One such method of decomposing a web page into nodes having the above properties is through the analysis of a hierarchical tree structure in a Document Object Model (DOM) of the web page. The DOM tree structure of the web page may be inherent to or generated from the Hypertext Markup Language (HTML) or other web document from which the web page is rendered. Thus, in certain embodiments the decomposition criteria (215) provided to the decomposition module (210) may be that a node is a leaf node in the DOM tree where:
An affinity matrix computation module (220) may calculate one or more matrices in which a numeric representation of the “affinity” between any two nodes of the web page is given. As used in the present specification and in the appended claims, the “affinity” between two nodes is a measure of the probability that the two nodes are interdependent or related to the same subject matter. In certain embodiments, multiple affinity matrices may be created for the nodes, in which each affinity matrix relies on a different criterion for calculating node affinity. These matrices may then be combined into a composite affinity matrix that specifies a composite affinity value for each possible pair of nodes from the web page.
Possible criteria for calculating the affinity between two different nodes include, but are not limited to, a Euclidean or block distance between the two nodes in the rendered web page; a distance between the two nodes in the DOM tree; the respective hierarchical levels of the two nodes in the DOM tree; a degree of horizontal alignment between the two nodes in the rendered web page; a degree of vertical alignment between the two nodes in the rendered web page; a number of other nodes displayed between the two nodes in the rendered web page; a difference in type between the two nodes (e.g., image, text (HTML heading1, heading2, paragraph), embedded content); a degree of difference in font size of text present in the two nodes; a difference in the number of characters in text present in the two nodes; a degree difference in visual appearance (e.g., using one or more histograms of color, intensity, edge orientation, or magnitude); a difference in node size; and a degree of overlap or enclosure between the two nodes.
A functional area clustering module (225) then performs clustering on the nodes based on the one or more affinity matrices. One simple method of doing so is to derive a connectivity map between the nodes based on one or more predetermined or adaptively computed thresholds (230). In other words, if the measured affinity between two nodes is higher than a predetermined or adaptively computed threshold, the two nodes are “connected.” Groups of interconnected nodes are then clustered together to create functional blocks, thereby completing the segmentation of the web page.
It can be important to determine the appropriate clustering threshold (230) to achieve satisfactory segmentation results. In certain embodiments, the clustering threshold (230) may be based on the type of the web page and the application of the segmentation. Alternatively, a peak value of the distribution of the affinities may be chosen as the threshold (230) for each web page. The threshold may therefore adapt to the web page and be flexible on many different types of web pages.
In certain embodiments, one or more additional modules (not shown) may be present in the functionality (200) of the web page segmentation device (105,
For example, the web page segmentation device (105,
This process may be performed automatically in response to an automatic or user-generated trigger. Thus, in certain embodiments a user may instruct a computer to print a web page containing an article of interest by pressing a print button. The computer may segment the web page into functional blocks as described above, and then determine which of those blocks is most relevant to the article of interest using user-generated or automatically obtained keywords. The computer may then automatically generate a document incorporating only those functional blocks that are believed to be components of the article itself (e.g., as distinguished from advertisements, navigation information, background images, irrelevant embedded content, etc.) and print the document.
In other examples, the web page segmentation device (105,
Referring now to
The preceding description has been presented only to illustrate and describe embodiments and examples of the principles described. This description is not intended to be exhaustive or to limit these principles to any precise form disclosed. Many modifications and variations are possible in light of the above teaching.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2010/000523 | 4/19/2010 | WO | 00 | 9/15/2012 |