With the development of search engine and relative technologies, information in web pages now has already owned a good accessibility for users. However, not all parts of a web page are useful for users. There are some sections that may meet users' needs while other parts are useless like advertisement and side bars. Though users may have their personal preferences, but there are still some common valuable sections in the web page that are interesting to them.
The accompanying drawings illustrate various examples of various aspects of the present disclosure. It will be appreciated that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the figures represent one example of the boundaries. It will be appreciated that in some examples one element may be designed as multiple elements or that multiple elements may be designed as one element. In some examples, an element shown as an internal component of another element may be implemented as an external component and vice versa.
a) and 9(b) shows the recommending results for the same web pages by the original smart print and a method of the present disclosure respectively.
A typical way to detect valuable sections in a web page is based on its structure features, which is also referred to as a page-based detection method. In this type of method, page segmentation is an essential pre-processing step, wherein a page is divided into sections and each section is given a different weight based on some features. These page segmentation algorithms can partition a page into several regions with different importance. A document object model (DOM)-based method to extract useful information from the HTML document of web page has been raised. A DOM is a cross-platform and language-independent convention for representing and interacting with objects in various markup language documents. Aspects of the DOM, such as its elements, may be addressed and manipulated. An element is an individual component of the particular markup language used. A DOM-tree renders these elements as nodes within a tree. A node may also correspond to a small unit of data that resides on a web page, which is also referred to as a section in this disclosure. The DOM-based method parses the DOM tree of a web page instead of its raw HTML document. As a result, time and storage consuming of HTML parsing decreases significantly.
According to the DOM-based style, some vision-based segmentation and block importance learning algorithms are developed. Besides a DOM tree structure, the vision-based algorithm also takes usual cue into consideration and can compute the importance of a region or block depending on its spatial and content features. Such methods can weight each importance of block effectively, but the meaning of importance is not always reasonable since it comes from the style of web page other than the need of users.
Another method to extract meaningful article from web pages has also been developed, in which the DOM tree and visual features are used to divide pages and extract user needed article from text node. Compared with algorithms which use all the text nodes in DOM tree, this method try to partition those nudes into several text segments. Then by finding out an optimized subsequence of text nodes in those segments, it can recommend to users a continual and valuable article. In this way, the extracted articles can keep the influence of nonsense information like advertisements or auxiliary information. Such method can provide good experience to users when they need automatic extraction of text articles, but it only provide a limited method to deal with pages having lots of texts contain like news pages, encyclopedia entries, etc.
Another DOM as id visual based method has been developed to detect print-worthy content in web page. Unlike the previous article extraction methods, this method does not only focus on text sections, but also eon select other kinds of sections like images. This method divides web pages and calculates importance weight of each block by DOM tree and visual features. The process of print-worthy section recommendation normally has three steps: web page segmentation, block importance calculation and extraction. In the segmentation step, a web page is divided into smallest elements, then these elements arc clustered into blocks or areas based on the result of affinities computing between elements. After partitioning pages into reasonable blocks, importance of each block is calculated, wherein importance is determined by the visual features of blocks and blocks which are highlight, few hyperlinks and locating high are given high importance weight. At last, recommended sections arc extracted by computing the best subtree that has the highest weight score. Following this strategy, useful sections in many kinds of pages can be extracted. But it still owns some shortcomings: first, visual matures may not reflect customers' opinions since it comes from personal experience; second, it cannot adapt to some pages very well, for example, if the text in the page is very long, then this algorithm will ignore article located at the bottom; third, it does not have an automatic process to adjust recommendation results through the feedbacks of users.
In examples of the present disclosure, instead of those page-based methods, generally accepted valuable sections in a public web page are detected based on a user log. Compared with the page-based methods, the log-based method presented herein can obtain more precise and reasonable valuable sections.
In the following, certain examples according to the present disclosure are described in detail with reference to the drawings.
With reference to
The system 100 may include a server 102, and one or more client computers 104, in communication over a network 106. As illustrated in
The network 106 may be a local area network (LAN), a wide area network (WAN), or another network configuration. The network 106 may include routers, switches, modems, or any other kind of interface device used for interconnection. The network 106 may connect to several client computers 104. Through the network 106, several client computers 104 may connect to the server 102. The client computers 104 may be similarly structured as the server 102.
The server 102 may have other units operatively couples to the processor 108 through the bus 110. These units may include tangible, machine-readable storage media, such as storage 122. The storage 122 may include any combinations of hard drives, read-only memory (ROM), random access memory (RAM), RAM drives, flash drives, optical drives, cache memory, and the like. Storage 122 may include a receiving unit 124 and a detecting unit 126. The receiving unit 124 may receive an input webpage from which valuable sections therein may be detected. The web page may be accessed using the network 106. The detecting unit 126 detects valuable sections in the input webpage based on a user log of a reference webpage associated with the input webpage, wherein the reference webpage can be either the same webpage as the input one or a similar webpage(s) to the input webpage. A user log indicated previous usage history of a webpage by a user(s) and may comprise a path of a section within a webpage that was accessed (including clipped or printed) by the user(s) in a DOM-tree that represents this webpage. Each section or block in the page is a path of the DOM-tree which stores as an XPath in the user log. For example, an XPath HTML/BODY/DIV[1] means a path in DOM-tree which begins with HTML tag and ends with first DIV tag in the subtree of BODY tag. Such user logs can be stored in a log database (not shown) in the storage 122.
Although not shown in
With reference to
With reference to
With reference to
Different people may select different valuable sections in the same page, but there are still some sections that most users consider to be useful. The target of log synthesizing is to find out those commonly acknowledged useful sections and put forward them to users. The result of log synthesizing may return a set of XPaths which can represent users' common ideas of valuable sections. To calculate such common sections, a similar measure between XPaths need to defined first. According to an example, a measure of tag edit distance is used to measure the similarity between two XPaths.
The tag edit distance is an extension of edit distance. A tag in an XPath is regarded as a basic element and divides the XPath by ‘/’. When calculating a tag edit distance, the update and insert operations are only used because other operations like delete may result in the loss of tag relative information. Two XPaths are compared tag by tag. If two tags are equal then proceed to the next tag, otherwise one tag is updated to make them equal or a new tag is inserted at the end of the shorter XPath if it has no tag to compare with. At last one gets two same XPaths and the number of needed operations of this process. For example, assuming that there are two XPaths, XPath1: HTML/BODY/DIV[1] and XPath2; HTML/BODY/DIV[2]/DIV[1], in order to change XPath1 to XPath2, the DIV[1] tag in XPath1 should be updated and a DIV[1] should be inserted at the end of XPath1. The needed operation number is 2. This number is defined herein as an example of the tag edit distance between two XPaths.
For a webpage, it has record sets of several users {R1, R2 . . . Rn} and each user selects several sections in the page which represent as XPaths in a user log Ri={x1, x2 . . . xn}. As shown in block 401 of
Where Tdistance is the tag edit distance between jth XPath in the intersection set Xi and tth XPath in the subtraction set of Xu and Xj, |Xij| is the number of tags in this XPath. |Xi| is the number of XPaths in intersection set Xi, Xst is the tth XPath in the subtraction set, and |Xst| is the number of tags in this XPath. Here, the subtraction set is used instead of union set because the intersection set is a subset of the union set and the minimal distance will be 0 if XPaths in intersection set are not removed from the union set.
According to the above formula, a similarity score can be calculated for all the same pages in the log. According to an example of the disclosure, a threshold τ can be set for the similarity measure. If Similarity(Xi, Xu)>τ then the user is recommended with the intersection set Xi because the XPaths in intersection can reflect most users' idea of valuable section and XPaths in subtraction set are only slight adjustment of common valuable sections. If Similarity(Xi, Xu)<τ, it means that users have significantly different ideas about which sections are valuable so recommendations should not be made to the user, instead a page-based tool can be used to select valuable sections, as shown in block 306 of
With reference to
For a new-coming page, since there is no previous record in the user log, so it is impossible to recommend valuable sections in this page to a user only by log synthesizing. According to an example of the present disclosure, a weighted tag tree based method is proposed to recommend valuable sections by leveraging user log of similar web pages. A set of XPaths of each section in the new-coming page is first generated for the new-coming page, as shown in block 501. Then, a weighted tag tree is generated based on the XPaths of the similar webpages in the user log, as shown in block 502 and described in detail below.
Since similar web pages detection is not the focus of this disclosure, we suppose that a set of similar pages {Ps1, Ps2, . . . , Psn} for a new coming page Pnew has been obtained. Then a weighted tag tree from selected records in this similar page set is constructed, wherein “selected” means that a user selects a section as a valuable section. These records are converted into a tree by the following process. Since all XPaths begin with a tag “HTML”, “HTML” is set as root of the tree. Then each selected XPath is scanned, each tag of the XPath is set as the subtree of its previous tag, and if there exists the same tag in the same position, then the count of this node is added by one, which count is used as the weight for the node. That is, a weight of each tag in the weighted tag tree is the number of times that the tag appears at a same position in all the paths constituting the weighted tag tree. For example, there are 4 selected XPaths:
1: HTML/BODY/DIV[0]/DIV/H1[0]
2: HTML/BODY/DIV[0]/DIV[1]
3: HTML/BODY/DIV[0]/H1[0]
4: HTML/BODY/DIV[1]
The resulting weighted tag tree of these XPaths is shown in
After the weight tag tree is constructed, a valuable section is detected from the new-coming page based on comparison between the weight tag tree and each of XPaths in the set generated for the new-coming page, as shown in block 503. Specifically, detecting a valuable section based on comparison between the weight tag tree and each of XPaths in the set generated on the new-coming page includes: letting each XPath go through the weight tag tree; summing the weights of nodes that are passed by the XPath as a score of the XPath; and detecting a valuable section in the webpage based on the value of the score.
For example, a new coming page has the following XPath sequences:
.HTML/BODY/DIV[0]/DIV[0]/H1[0]
.HTML/BODY/DIV[1]/DIV[2]
.HTML/BODY/DIV[0]/DIV[0]/DIV[1]/P1[1]
.HTML/BODY/DIV[0]/DIV[1]
Let them go through the weighted tag tree shown in
Once the score of each XPath is calculated, a valuable section in the webpage can be detected based on the scores. For example, a section the score of whose XPath is the highest or sections whose scores are higher than a predefined threshold can be detected and recommended to the user.
However, if we simply sum the scores of nodes that are passed by an XPath into the score of this XPath, it will result in a situation that the longer an XPath is, the higher its score is. Therefore, according to an example of the present disclosure, the score ears be adjusted based on at least one of the following factors: the number of nodes in the weighted tag tree, the average length of paths that constitute the weighted tag tree, and the length of XPath that goes through the weighted tag tree. According to an example, the score can be adjusted according to the following formula:
Wherein Scorenode is the count number in nodes, Lengthaverage is the average length of XPaths, which constitute the weighted tag tree and LengthXPath is the length of XPath that goes through the weighted tag tree.
Through this adjustment, the more the length of an XPath is close to the average length, the less its penalty is. In this way, the score of long XPaths and XPaths whose length are close to the average length of XPaths in weighted tag tree can be adjusted. This is a reasonable adjustment because few valuable sections in a webpage can be too big or too small, that is to say, the recommended XPath should not be too long nor too short but within a appropriate length. After adjustment, the scores are changed as following:
Then, by example, the third and forth XPath can be detected as a valuable section and recommended to the user.
With reference to
In this example, in addition to an XPath of a section that was visited by a user previously (i.e. the user selects this section as a valuable section) in the DOM-tree, the user log further includes an XPath of a section that was de-selected by a user previously (i.e. the user considers this section as a useless section or a low value section) in the DOM-tree that represents the webpage. The result of recommendation would be more meaningful if these low-value sections are removed from the results of detection at block 503. As shown in block 504, those sections that are frequently de-selected by the user are found based on the user log. According to an example, the number of each de-selected XPath is counted and the sections the number of which exceeds a predetermined threshold are retrieved as representing low-value sections. Then, as shown in block 505, these found sections are removed from the valuable sections detected in block 503.
Some experiments are carried out by using the primary smart print tool as reference to evaluate the above described process.
With reference to
The non-transitory, computer-readable medium 800 may correspond to any typical storage device that stores computer-implemented instructions, such as programming code or the like. For example, the non-transitory, computer-readable medium 800 may include one or more of a non-volatile memory, a volatile memory, and/or one or more storage devices. Examples of non-volatile memory include, but are not limited to, electrically erasable programmable read only memory (EEPROM) and read only memory (ROM). Examples of volatile memory include, but are not limited to, static random access memory (SRAM), and dynamic random access memory (DRAM). Examples of storage devices include, but are not limited to, hard disks, compact disc drives, digital versatile disc drives, and flash memory devices.
A processor 802 generally retrieves and executes the computer-implemented instructions stored in the non-transitory, computer-readable medium 800 for detecting valuable sections on a web page. At block 804, a receiving module may receive an input webpage from which valuable sections therein may be detected. At block 806, a detecting module may detect valuable sections in the input webpage based on a user log of a reference webpage associated with the input webpage, as described above.
From the above depiction of the implementation mode, the above examples can be implemented by hardware, software or firmware or a combination thereof. For example the various methods, processes, modules and functional units described herein max be implemented by a processor (the term processor is to be interpreted broadly to include a CPU, processing unit, ASIC, logic unit, or programmable gate array etc.) The processes, methods and functional units may all be performed by a single processor or split between several processors. They may be implemented as machine readable instructions executable by one or more processors. Further the teachings herein may be implemented in the form of a software product. The computer software product, is stored in a storage medium and comprises a plurality of instructions for making a computer device (which can be a personal computer, a server or a network device, etc.) implement the method recited in the examples of the present disclosure.
The figures are only illustrations of an example, wherein the modules or procedure shown in the figures arc not necessarily essential for implementing the present disclosure. Moreover, the sequence numbers of the above examples are only for description, and do not indicate an example is more superior to another.
Those skilled in the art can understand that the modules in the device in the example can be arranged in the device in the example as described in the example, or can be alternatively located in one or more devices different from that in the example. The modules in the aforesaid example can be combined into one module or further divided into a plurality of sub-modules.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2012/000569 | 4/28/2012 | WO | 00 | 7/31/2014 |