Web pages provide an inexpensive and convenient way to make information available to its consumers. However, as the inclusion of multimedia content, embedded advertising, and online services becomes increasingly more prevalent in modern web pages, the web pages themselves have become substantially more complex. For example, in addition to their main content, many web pages display auxiliary content such as background imagery, advertisements, or navigation menus, and links to additional content.
It is often the case that owners or consumers of web pages wish to utilize or adapt only a portion of the information presented in a web page. Such uses of only a portion of the content presented in a web page can require tedious effort on the part of a user to distinguish among the different types of content on the web page and retrieve only the desired content. Automatic selection of the main content in web pages can eliminate extraneous or undesired content and significantly streamline a number of workflows. For instance, a user may desire to print a physical copy of an internet article without reproducing any of the irrelevant content on the web page containing the article. Similarly, an owner of a web page may wish to adapt a web page into another document, such as a marketing brochure, without including content in the web page that is superfluous to the new document. Additionally, a user may wish to display only the most relevant web content on a computing device with a limited screen size. Other applications which may benefit from automatic selection of the main content in web pages include: search, information retrieval, information management, archiving, and other applications.
The accompanying drawings illustrate various embodiments of the principles described herein and are a part of the specification. The illustrated embodiments are merely examples and do not limit the scope of the claims.
Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
The present specification discloses various methods, systems, and devices for automatically selecting the main part of a web page. As discussed above, there are many applications where automatically selecting the main pail of a web page can be advantageous. For purposes of explanation, the specification uses the illustrative example of selecting the main part of a web page to enhance the printing of the web page. Currently, when a web page is printed, it includes a variety of contents. For example, in addition to the main content, many web pages display content such as background imagery, advertisements, or navigation menus, headers/footers, and links to additional content. Some of the contents may be printworthy, but the user may not want to print some or all of the auxiliary contents. Ideally, the algorithm automatically selects only the main content and presents it to the user for printing.
There are a number of challenges in automatic selection of main content in web pages. For example, web pages vary widely by content type. Common types of web pages include: news, shopping, blog, map, and recipe web pages. The web page layouts also vary widely across the different types of web pages. The web pages also included a variety of content, including text, images, video and flash. To effectively select the main content in web pages, the algorithm determines not only a relative ordering of importance of content but also an absolute determination whether content can be categorized as main content. According to one illustrative example, the algorithm determines block, area, or areas of the web page which contains the main content.
As used in the present specification and in the appended claims, the term “web page” refers to a document that can be retrieved from a server over a network connection and viewed in a web browser application.
As used in the present specification and in the appended claims, the term “segment” refers to one of a plurality of coherent units into which the entire content of a web page has been partitioned.
As used in the present specification and in the appended claims, the term “coherent,” as applied to a segment, refers to the characteristic of having content with the same type or property.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present systems and methods. It will be apparent, however, to one skilled in the art that the present systems and methods may be practiced without these specific details. Reference in the specification to “an embodiment,” “an example” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least that one embodiment, but not necessarily in other embodiments. The various instances of the phrase “in one embodiment” or similar phrases in various places in the specification are not necessarily all referring to the same embodiment.
Referring now to
The web page analysis device (105) of the present example is a computing device configured to retrieve the web page (110) hosted by the web page server (115) and divide the web page (110) into multiple coherent, functional blocks. In the present example, this is accomplished by the web page analysis device (105) requesting the web page (110) from the web page server (115) over the network (120) using the appropriate network protocol (e.g., Internet Protocol (“IP”)). Illustrative processes for automatic selection of the main content in web pages are set forth in more detail below.
To achieve its desired functionality, the web page analysis device (105) includes various hardware components. Among these hardware components may be at least one processing unit (125), at least one memory unit (130), peripheral device adapters (135), and a network adapter (140). These hardware components may be interconnected through the use of on more busses and/or network connections.
The processing unit (125) may include the hardware architecture necessary to retrieve executable code from the memory unit (130) and execute the executable code. The executable code may, when executed by the processing unit (125), cause the processing unit (125) to implement at least the functionality of retrieving the web page (110) and analyze a web page (110) for automatic selection of its main content according to the methods of the present specification described below. In the course of executing code, the processing unit (125) may receive input from and provide output to one or more of the remaining hardware units.
The memory unit (130) may be configured to digitally store data consumed and produced by the processing unit (125). The memory unit (130) may include various types of memory modules, including volatile and nonvolatile memory. For example, the memory unit (130) of the present example includes Random Access Memory (RAM), Read Only Memory (ROM), and Hard Disk Drive (HDD) memory. Many other types of memory are available in the art, and the present specification contemplates the use of any type(s) of memory in the memory unit (130) as may suit a particular application of the principles described herein, in certain examples, different types of memory in the memory unit (130) may be used for different data storage needs. For example, in certain examples the processing unit (125) may boot from ROM, maintain nonvolatile storage in the HDD memory, and execute program code stored in RAM.
The hardware adapters (135, 140) in the web page analysis device (105) are configured to enable the processing unit (125) to interface with various other hardware elements, external and internal to the web page analysis device (105). For example, peripheral device adapters (135) may provide an interface to input/output devices to create a user interface and/or access external sources of memory storage Peripheral device adapters (135) may also create an interface between the processing unit (125) and a printer (145) or other media output device. For example, in examples where the web page analysis device (105) is configured to generate a document based on functional blocks extracted from the web page's content, the web page analysis device (105) may be further configured to instruct the printer (145) to create one or more physical copies of the document.
A network adapter (140) may provide an interface to the network (120), thereby enabling the transmission of data to and receipt of data from other devices on the network (120), including the web page server (115).
The root element in this DOM tree is the Content element (210) which has six sub-trees (209): Banner (215); Header (220); MainCol (225); Adcol (230); Reviews (235); and Footer (240). For purposes of illustration, subelements (250-285) are shown for only for the MainCol sub-tree (225). Dashed lines extending to the right of the other sub-trees show the continuation of the sub-trees with elements which are not illustrated in
The MainCol sub-tree (225) has two elements, LeftCol (250) and RightCol (255), at the next hierarchal level. LeftCol (250) has two elements at the lowest hierarchal level (257): MainImg (260) and SimRec (265). The RightCol (255) has four elements at the lowest hierarchal level (257): Rating (270), Descr (275), Ingred (280), and Prep (285). The elements at the lowest hierarchal level (257) are also called leaf nodes.
In this example, the MainCol (225) sub-tree contains the “main content” which a user would typically print or archive for further reference. The MainCol (225) contains a left column (250) and a right column (255). In left column (250), an image of the dish is shown in the MainImg element (260). Similar recipes are shown below the image in the SimRec element (265). The right column (255) includes an overall rating for the dish (270), a description of the dish (275), ingredients of the dish (280), and preparation instructions (285). These elements (260-285) may have a number of additional subelements.
The main content (290) is shown as a dashed box around a number of elements including the MainImg element (260), SimRec element (265), overall rating for the dish (270), ingredients of the dish (280), and preparation instructions (285). Not included in the main content are the Banner (215), Header (220), AdCol (230), Reviews (235) and Footer (240). A visual separator (292) divides the header (220) from the main content (290).
Selecting the print-worthy content area can be largely divided into three steps. The first step is the web page segmentation (305) which divides the web page into several coherent areas. The second step is the block importance computation (330) which calculates the importance score for each block or area. The third step is the extraction (335) which outputs the most print-worthy area given the segmentation and the block importance results. Details for each step are described in the following subsections.
The web page segmentation (305) divides the web page into coherent areas where each area has a meaningful function in the document. Examples of meaningful function include but are not limited to title and header. In general, the web page segmentation (305) uses a bottom-up approach. To do this, the web page is first divided into many basic elements called atoms. The atoms collection module (310) divides the web page into many basic elements called atoms. The atoms are basic elements of the web page which generally cannot be broken up into smaller pieces. The atom generation is collectively exhaustive because it includes all useful contents and mutually exclusive because there is no spatial overlap between atomic elements.
According to one illustrative example, the atoms can be thought of as leaf nodes in the DOM tree (200,
In summary, the atoms collection module (310) gathers all the visible and useful leaf nodes by crawling the DOM tree (200,
The affinities between the atoms are then calculated by the affinities computation module (315). The affinities (or distances) are computed between all the atoms collected by the atoms collection module (310). The underlying idea is to measure how “similar” the two atoms are in many different ways and then judge how likely it is for the two atoms to be merged or belong to one area/block. By using a wide variety of characteristics/dimensions to calculate affinities, the affinities computation becomes more robust and accurate.
According to one illustrative example, there are tens of affinity dimensions. For example, there may be 60 or more affinity dimensions used by some affinity computations modules (315). Each of the affinity dimensions may be classified into the following categories: i) geometric, ii) DOM structure, iii) tag type and iv) style. One example of geometric affinity is the Euclidean distance between the spatial locations of the two atoms. The larger this distance is, the less likely the two atoms are to be clustered together. Another example is horizontal/vertical overlap between atoms and whether they are aligned horizontally or vertically. One example of DOM structure affinity is the distance one needs to traverse in the DOM tree (200,
Visual separations in the web pages are detected by the visual separator detection module (325). The term “visual separations” refers to the division of web pages into multiple parts by lines or frames. We name such lines as visual separator lines. Frames are included in the visual separators as a frame is comprised of two horizontal lines and two vertical lines. Visual separator detection (325) computes the presence and the locations of visual separators in web pages. Such lines provide indications as to how the web page should be segmented. For example, an area needs to be divided further if a strong visual divider cuts across the area. We employ several methods to detect visual separators. First, elements with certain tags are identified as visual separators. Examples of such tags include <HR> and <TEXT AREA>. Second, HTML elements with border properties can be examined. These HTML elements are marked as visual separators if the corresponding borders are wider than zero. Third, a DOM node's background color may be different from its parent DOM node's background color. If their difference is bigger than a threshold, then the four borders of this DOM node are taken as visual separators. Fourth, the visual separator detection module (325) detects tiny images that have large repetitions since visual separators are often generated in such a way. The results from these and other methods can be appropriately merged to avoid lines detected multiple times. Once the visual separators are located, they are encoded into the affinity values between the atoms. If a visual separator is present between the atoms, then the affinity values between such atoms are very low, making them very difficult to be clustered into one segment.
The atoms are then clustered based on various affinity values by the atoms clustering module (320). Similar atoms are clustered into segments by examining their affinity values and selectively clustering the atoms with high affinities. The atom clustering module (320) uses a variety of information including the DOM and the visual representation of the page rather than relying only on a few aspects of the web page. While clustering can be performed by globally examining all the affinity values, a computationally simpler approach is to use composite affinities by performing various linear combinations affinity values where the weights are determined heuristically. In some examples, the weights, combinations, and other parameters can be obtained from a training data set.
The atoms are clustered into segments by merging the atoms whose affinities are above a certain threshold. Note that the threshold is not pre-determined but computed adaptively based on the input. The threshold is chosen such that a small increase in its value results in the largest decrease in the number of segments. Additionally or alternatively, additional constraints such as minimum and maximum bounds may limit the total number of segments. These thresholds can be selected to reflect the spatial characteristics of the web page design.
Web page segmentation is further described in PCT App. No. PCT/CN20101000523, entitled “Segmenting a Web Page into Coherent Functional Blocks,” to Suk Hwan Urn et al., filed Apr. 19, 2010, which is incorporated by herein by reference in its entirety.
After the web page segmentation (305), the block importance score for each segment is computed by the block importance module (330). The importance of a segment is determined by many factors/features. The score of each feature is calculated and the scores are then combined using appropriate weighting values to obtain the final block importance score. These weights can be derived from a training data set or pre-defined by rules.
The following features are illustrative examples of features which can be used to calculate importance scores for the various segments in a web page.
Following the web page segmentation (305) and the block importance (330) calculation, the main content is selected based on the segmented blocks and their importance scores by the extraction module (335), in one example, the extraction algorithm (335) selects only a single sub-tree in the DOM tree of the original web page. This constraint is based on the observation that the main content area in most pages can be represented by one sub-tree. This additional constraint allows the extraction algorithm (335) to be more robust and stable.
As is shown in
Most web pages contain headers, footers or sidebars, which do not contribute to and are not part of the main content area. Consequently, the approximate main area detection (340) identifies and deletes these superfluous sub-trees from the DOM tree and other data to form a stripped-down web page. The stripped-down web page is a generous estimate of what portions of the web page may contain the main content area. This estimate is performed by computing features similar to those described above, but for the sub-trees instead of segments. Due to the mixture of content within a sub-tree (rather than the homogenous content for each segment), this method works well in determining the non-relevant content which should be filtered out of the web page.
The stripped-down web page and/or DOM tree is then passed to the best sub-tree computation module (345). In an alternative example, the entire web page is passed into the best sub-tree computation module (345) through the block importance module (330). The best sub-tree computation module (345) calculates the main content area (350). Where the stripped-down web page is used, all the remaining sub-trees in the stripped-down web page are considered as candidates for the main content node. Where the entire webpage is passed through the block importance module (330) to the best sub-tree computation module (345), all of the sub-trees in web page are considered as candidates. Final scores are computed for each candidate sub-tree. The final score for each sub-tree is calculated by multiplying the importance score of the sub-tree and its area score.
In order to compute the final importance score for the sub-tree, all the segments that spatially intersect with the sub-tree are found. Since each segment has a block importance score computed by the block importance module (330), the weighted average of the block importance scores can be calculated. The weights are proportional to the areas that intersect between the segments and the candidate sub-tree.
The area score is a function of the area or the size of the candidate sub-tree and reflects the prior knowledge of the desired size of the print-worthy content. This function can be modified to shape the behavior of main content selection. For example, the desired size of print-worthy content may be represented by a range of sizes, ratios of width to height, or other method. The desired size may be determined based on a number of factors, including the type of web page, printer settings, printer media sizes, user preferences, and other factors. The desired size of print-worthy content is used to penalize overly large or overly small candidate sub-trees whose selection would be detrimental to the user experience of web page printing.
The final score for each sub-tree is then calculated by combining the importance score and the area score for each candidate sub-tree. The candidate sub-tree with the highest score is then selected as the main content 350) for printing.
To provide a concrete example, the content selection algorithm (302) shown in
In a first step, the web page segmentation module (305,
The affinities between the atoms are then calculated by the affinities computation module (315,
A variety of additional affinities can also be calculated. For example, the vertical or horizontal alignment of the atoms can be determined. The affinities computation module may analyze the atoms in the header (220,
Additionally, the affinity computation module may determine that the Descr element (275,
Visual separations in the web pages are detected by the visual separator detection module (325,
The atoms are then clustered based on various affinity values by the atoms clustering module (320,
After the web page segmentation (305,
Following the web page segmentation (305,
The approximate main area detection module (340,
This stripped-down web page and DOM tree is then passed to the best sub-tree computation module (345,
Scores are computed for the review sub-tree (235,
In sum, the content selection algorithm and system described above is effective in automatically selecting the main content from a wide variety of web pages. As discussed above, the selection of the main content of web pages can facilitate a number of workflows. For instance, a user may desire to print a physical copy of an internet article without reproducing any of the irrelevant content on the web page containing the article. In another example, the user may wish to scrape the main content from the web page to form a clip. The clip is then combined with other data to form a composite document. Other applications which may benefit from automatic selection of the main content in web pages include: search, information retrieval, information management, archiving, and other applications.
The preceding description has been presented only to illustrate and describe embodiments and examples of the principles described. This description is not intended to be exhaustive or to limit these principles to any precise form disclosed. Many modifications and variations are possible in light of the above teaching.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2010/001157 | 7/30/2010 | WO | 00 | 3/25/2013 |