Web pages located on the World Wide Web and accessed via the Internet include a variety of content including text, images, and other forms of multimedia. These web pages are often divided into multiple portions or regions by horizontal lines, vertical lines, and frames. These lines are separator lines.
When viewed in terms of web page design, content located within the different regions of the web page defined by the separator lines have different semantic meanings (i.e., the relationships of characters or groups of characters to their meanings, independent of the manner of their interpretation and use) or document functions (e.g., a portion of an article or a sidebar). Being able to detect separator lines within the web pages is very useful in subsequent processing of a web page including, for example, web page printing, block level based web page searching, web page segmentation, and many other applications.
The accompanying drawings illustrate various embodiments of the principles described herein and are a part of the specification. The illustrated embodiments are merely examples and do not limit the scope of the claims.
Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
Web pages provide an inexpensive and convenient way to make information available to its consumers. However, as the inclusion of multimedia content, embedded advertising, and online services becomes increasingly more prevalent in modern web pages, the web pages themselves have become substantially more complex. For example, in addition to their main content, many web pages display auxiliary content such as background imagery, advertisements, or navigation menus, and links to additional content. Web pages are often divided into multiple parts or segments by horizontal lines, vertical lines, and frames.
The detection of these visual separators can assist in a number of web page operations. For example, owners or consumers of web pages may wish to utilize or adapt only a portion of the information presented in a web page. The visual separators may assist in automatically defining segments contained in a web page. Once the content of the web page is divided into segments, the segments which contain the desired information can be identified and the remainder of the segments discarded. For instance, a user may desire to print a physical copy of an internet article without reproducing any of the irrelevant content on the web page containing the article. Visual separators can be one indicator which allows for the print-worthy content to be segmented from other information such as advertisements, headers, footers, or other extraneous information. Visual separators could be used in a variety of other applications such as porting web pages to mobile devices with limited screen sizes, clipping web content for inclusion into a composite document, search, information retrieval, information management, archiving, and other applications.
There are a number of challenges in correctly and automatically identifying visual separators from web page code. For example, web pages vary widely by content type. Common types of web pages include: news, shopping, blog, map, and recipe web pages. The web page layouts also vary widely across the different types of web pages. The web pages also included a variety of content, including text, images, video and flash. To effectively extract visual separators from the web page code, visual separator algorithm uses a number of techniques, including: identification DOM tag names which denote visual separation, analysis of border properties, detecting color differences and identifying image repetition.
As used in the present specification and in the appended claims, the term “web page” refers to a document that can be retrieved from a server over a network connection and viewed in a web browser application. The term “visual separator” refers to an element or arrangement of elements in a web page which graphically partition a web page into coherent segments. As used in the present specification and in the appended claims, the term “coherent,” as applied to a web page segment, refers to the characteristic of having content/functionality of the same type or property.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present systems and methods. It will be apparent, however, to one skilled in the art that the present systems and methods may be practiced without these specific details. Reference in the specification to “an embodiment,” “an example” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least that one embodiment, but not necessarily in other embodiments. The various instances of the phrase “in one embodiment” or similar phrases in various places in the specification are not necessarily all referring to the same embodiment.
Referring now to
The web page analysis device (105) of the present example is a computing device configured to retrieve the web page (110) hosted by the web page server (115) and identify visual separators within the web page (110) using code analysis. In the present example, this is accomplished by the web page analysis device (105) requesting the web page (110) from the web page server (115) over the network (120) using the appropriate network protocol (e.g., Internet Protocol (“IP”)). Illustrative processes for automatic selection of the main content in web pages are set forth in more detail below.
To achieve its desired functionality, the web page analysis device (105) includes various hardware components. Among these hardware components may be at least one processing unit (125), at least one memory unit (130), peripheral device adapters (135), and a network adapter (140). These hardware components may be interconnected through the use of one or more busses and/or network connections.
The processing unit (125) may include the hardware architecture necessary to retrieve executable code from the memory unit (130) and execute the executable code. The executable code may, when executed by the processing unit (125), cause the processing unit (125) to implement at least the functionality of retrieving the web page (110) and analyze a web page (110) for automatic selection of its main content according to the methods of the present specification described below. In the course of executing code, the processing unit (125) may receive input from and provide output to one or more of the remaining hardware units.
The memory unit (130) may be configured to digitally store data consumed and produced by the processing unit (125). The memory unit (130) may include various types of memory modules, including volatile and nonvolatile memory. For example, the memory unit (130) of the present example includes Random Access Memory (RAM), Read Only Memory (ROM), and Hard Disk Drive (HDD) memory. Many other types of memory are available in the art, and the present specification contemplates the use of any type(s) of memory (130) in the memory unit (130) as may suit a particular application of the principles described herein. In certain examples, different types of memory in the memory unit (130) may be used for different data storage needs. For example, in certain embodiments the processing unit (125) may boot from ROM, maintain nonvolatile storage in the HDD memory, and execute program code stored in RAM.
The hardware adapters (135, 140) in the web page analysis device (105) are configured to enable the processing unit (125) to interface with various other hardware elements, external and internal to the web page analysis device (105). For example, peripheral device adapters (135) may provide an interface to input/output devices to create a user interface and/or access external sources of memory storage. Peripheral device adapters (135) may also create an interface between the processing unit (125) and a printer (145) or other media output device. For example, in embodiments where the web page analysis device (105) is configured to generate a document based on functional blocks extracted from the web page's content, the web page analysis device (105) may be further configured to instruct the printer (145) to create one or more physical copies of the document.
A network adapter (140) may provide an interface to the network (120), thereby enabling the transmission of data to and receipt of data from other devices on the network (120), including the web page server (115).
The mark up language is often used in combination with a variety of other protocols which extend its capabilities. For example, HTML often uses Document Object Model (DOM) trees, hierarchies and elements. DOM is a cross-platform and language independent convention for representing and interacting with web page elements in mark up languages. HTML and DOM are also combined with style sheet languages such as Cascading Style Sheets (CSS) which describe the presentation semantics of a document written in the markup language.
In the illustrative visual separator algorithm (202) shown in
The DOM tree is then traversed to generate a DOM node list (step 224). Each node is then analyzed to detect visual separators by a DOM node analysis engine (234). The node analysis may include a number of steps including: tag name analysis (step 224); border property analysis (step 254); detecting background color differences (step 264); and recognizing image repetition (step 274). Each of these steps in detecting visual separators is discussed in greater detail below.
Tag name analysis (step 224) includes recognizing HTML tags which directly create visual separators. For example, the HTML tag <hr> creates a horizontal line in an HTML page. This horizontal line is a visual separator. Another example is the HTML tag <textarea> which defines a multi-line text area which can hold an unlimited number of characters. The size of a textarea can be specified by row or column attributes or through CSS height and width properties. The edges/borders of an area defined by <textarea> can represent a visual separation between the text and the surrounding elements. According to one embodiment, the tag name analysis is designed to identify HTML tags which directly create one or more horizontal or vertical visual separators.
Border property analysis (step 254) recognizes visual separators which are created by HTML border properties which are wider than zero. For example, the following code represents a DOM “div” element which uses a CSS <border> property to surround text with a dotted orange border which is two pixels wide.
The <border> CSS property used in this example is a flexible command which can be used to create a wide variety of borders which surround or partially surround images, text or other elements. Because commands such as the <border> CSS property create lines horizontal or vertical lines, patterns, or whitespaces, they can be analyzed to produce visual separations present in a web page. A variety of other commands and methods can also be used to create borders in web pages. The border property analysis may also be configured to detect visual separators which are directly or indirectly created by a wide variety of borders and border commands. The border property analysis then outputs the visual separators which correspond to the borders which have widths which are greater than zero pixels.
Background color differences (step 264) can be used to identify visual separators (step 264). According to one illustrative embodiment, the background colors of various DOM nodes are compared with the background colors of adjacent or parent DOM nodes. If the difference in background colors is greater than a threshold value, the transition between the backgrounds is interpreted as a visual separation. The threshold value may be a predetermined value or may be dynamically determined from characteristics of the web page being analyzed.
The visual separations created by differences in background color are typically located along the transition between the different adjacent backgrounds. For example, following web code defines a DOM header element “h4” which has a white background color: h4 {background-color: white;}. Similarly, a DOM paragraph element “p” with a can be defined which has a blue background called out in hexadecimal notation: p {background-color: #1078E1;}. If the header and paragraph are adjacent to each other, the different background colors will create a visual separation between the two elements,
Small background images within a webpage can form visual separators by repetition in horizontal or vertical directions. By analyzing the web code for repetition of small background images (step 274) these visual separators can be identified.
As visual separators which are derived from the node properties are identified and output by the DOM node analysis engine (234), they are added to a visual separator list (284) as shown by the arrows on the right side of the flowchart. The DOM node analysis engine (234) repeats the analysis for each node.
In some embodiments, visual separators generated by different methods are also added to the visual separator list (284). For example, visual separators can be extracted from a rendered image of the web page. Techniques and examples of extracting visual separators from images of a web page is further discussed in PCT App. No. PCT/CN2010/______ attorney docket number 201001634, entitled “Detecting Separator Line in a Web Page,” to Suk Hwan Lim et al., filed on Jul. XX, 2010, which is incorporated by herein by reference in its entirety.
After the visual separator list is assembled, visual separators with one or more coinciding attributes are merged (step 294) by a merge module. For example, if both a border and a background result in the identification of two overlaying visual separators, the two visual separators may be merged to form a single visual separator. In some embodiments, intersecting separators may be merged to form more distinct boundaries within the web page. For example, if horizontal and vertical separators intersect, the two separators could be merged to form a portion of a rectangle. In some embodiments, the visual separators may not actually overly each other, but be parallel and adjacent to each other. These visual separators could also be merged. The merging of redundant visual separators (step 294) results in a final visual separator list (296) which represents the detected graphic divisions of the web page.
The root element in this DOM tree is the Content element (210) which has six sub-trees (209): Banner (215); Header (220); MainCol (225); Adcol (230); Reviews (235); and Footer (240). For purposes of illustration, subelements (250-285) are shown for only for the MainCol sub-tree (225). Dashed lines extending to the right of the other sub-trees show the continuation of the sub-trees with elements which are not illustrated in
The MainCol sub-tree (225) has two elements, LeftCol (250) and RightCol (225), at the next hierarchal level. LeftCol (250) has two elements at the lowest hierarchal level (257): Mainlmg (260) and SimRec (265). The RightCol (225) has four elements at the lowest hierarchal level (257): Rating (270), Descr (275), Ingred (280), and Prep (285). The elements at the lowest hierarchal level (257) are also called leaf nodes.
The visual separator algorithm (202,
The algorithm continues by analyzing the AdCol element (230), which creates a column on the right hand side of the web page that contains advertisements. The algorithm recognizes a number of borders and an <hr> tag which produces a horizontal dividing line (221). The algorithm next analyzes the MainCol element (225;
The algorithm analyzes the Reviews element (235) and recognizes that it has a background color which is substantially different from backgrounds of the surrounding elements of the web page. For example, the algorithm may compare the background color of the Reviews element (235) to is parent node, Content (210). Because the web page area of child nodes is typically encompassed by that of a parent node, the comparison of background colors of between child and parent nodes can be particularly effective.
After determining the difference between background colors, this difference is compared to a threshold. If the difference is greater than the threshold, the algorithm adds appropriate visual separators to the visual separator list (step 284). If the difference is less than the threshold, the algorithm determines that no visual separators should be added to the visual separator list.
As discussed above, the threshold value can be determined in a number of ways. A first method for determining the threshold value may be to set a predetermined level for the color difference that creates a visual separator. A more contextual approach to determining the threshold value is to analyze the web page to calculated the threshold. For example, the threshold value may be determined by examining background color differences between parent and child nodes across the whole web page. If the range of differences across the web page are low, the threshold will be correspondingly low. If the range of differences are large, the threshold will be correspondingly large. This adapts the threshold to the visual context in the web page and allows for more accurate determinations of visual separators based on background colors.
The merging of redundant visual separators (step 294,
For purposes of illustration, the horizontal and vertical visual separators are not show as being joined at the corners in
In sum, the visual separator algorithm and system described above are effective in automatically extracting visual separators from web code such as HTML, DOM, and CSS code elements. The visual separator detection effectively leverages the web page HTML content, such as tag names, tag properties, color differences, and image repetition. The use of this information provides detection results which are accurate and meaningful. Further, this HTML based approach can be performed quickly and with minimal memory requirements.
The preceding description has been presented only to illustrate and describe embodiments and examples of the principles described. This description is not intended to be exhaustive or to limit these principles to any precise form disclosed. Many modifications and variations are possible in light of the above teaching.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2010/075580 | 7/30/2010 | WO | 00 | 1/24/2013 |