The present disclosure relates to web pages, and more specifically to a method and system for determining relevant text in a web page.
Web pages contain a lot of information, and this information is often rendered in various forms. For example, web pages often contain areas of text, sidebars, advertisements, user-generated comments, etc. Sometimes, it is important to determine the relevant text of the web page, or the text associated with the subject matter described by the web page, and separate out the rest of the clutter in the web page, such as the web page advertisements, sidebars, etc. Therefore, there remains a need to determine the relevant text of a web page.
The determination of relevant text in a web page can facilitate the determination of topic(s) or category(ies) associated with the web page. For example, to determine a category or topic of a particular web page, an algorithm needs to analyze the relevant text of the web page and ignore the rest of the clutter (e.g., advertisements) present in the web page.
In one aspect, a computing device receives a web page and locates text elements in the web page, where each text element includes a set of one or more characters or symbols. For each text element found, the computing device assigns a weight value to the text element. The computing device then stores the text from the text element in a relevant text storage if the weight value for the text element is above a threshold weight.
In one embodiment, the storing of the text of each text element includes, if the weight value for a text element is below the threshold weight, storing the text of the text element when its layout or content has a similarity score above a threshold similarity score, the threshold similarity score associated with a text element having a weight value above the threshold weight.
In one embodiment, for each text element, the computing device determines the size of the text element when rendered. In one embodiment, the assigning of the weight value further includes assigning the weight value based on the size of the text element. The assigning of the weight value can include assigning the weight value based on a position of the text element in the web page.
In one embodiment, the locating of text elements in the web page further comprises using the Document Object Model (DOM) standard created by the World Wide Web Consortium (W3C) and implemented in all major web browsers to locate text nodes and their parent elements. In one embodiment, the text nodes are stored in a text node array and the parent elements are stored in a parent element array. In one embodiment, the storing of the text from the text element to the relevant text storage further includes storing the text from the text elements marked as relevant by weight and one or more text elements adjacent to those text elements which satisfy certain layout conditions.
The assigning of the weight value for each patent element can also include calculating the weight value from:
w=a/(1+((n*d)/T)
where w=the weight value of a current element, a=area of the current element, n=current index of the current element in the parent element array, d=a “drag” coefficient, and T=total number of elements in the parent element array. The calculating of the threshold weight can include calculating the threshold weight from:
wc=wavg*(T/c)
where wc=the threshold weight, wavg=average weight value for DOM element in document, T=total number of elements in the parent element array, and c=a weight average coefficient.
In one embodiment, each parent element is marked as potentially relevant if the weight of the parent element is above the threshold weight. In one embodiment, the parent element array is sorted in descending order by weight before comparing the weight of each parent element to the threshold weight.
After the comparison is made, the parent element array can be sorted in ascending order by a first node index value. In one embodiment, the computing device determines one or more of whether the sumtotal of text for all text elements which share each parent element have less than a predetermined number of characters of text, whether a left edge of a previous parent element and the current parent element match, and whether the space between the top of the current parent element and the bottom of the previous parent element is less than a maximum allowed gap. In one embodiment, the storing of the text from the text elements corresponding to the (current) parent element occurs if the current parent element is marked as relevant or the left edge of the previous parent element and the current parent element match and the space between the top of the current parent element and the bottom of the previous parent element is less than a maximum allowed gap and a ratio computed for the current parent element and the previous parent element is similar; or a left edge of a next parent element and the current parent element match and the space between the bottom of the current parent element and the top of the next parent element is less than a maximum allowed gap and a ratio computed for the current parent element and the next parent element is similar.
These and other aspects and embodiments will be apparent to those of ordinary skill in the art by reference to the following detailed description and the accompanying drawings.
In the drawing figures, which are not to scale, and where like reference numerals indicate like elements throughout the several views:
Embodiments are now discussed in more detail referring to the drawings that accompany the present application. In the accompanying drawings, like and/or corresponding elements are referred to by like reference numbers.
Various embodiments are disclosed herein; however, it is to be understood that the disclosed embodiments are merely illustrative of the disclosure that can be embodied in various forms. In addition, each of the examples given in connection with the various embodiments is intended to be illustrative, and not restrictive. Further, the figures are not necessarily to scale, some features may be exaggerated to show details of particular components (and any size, material and similar details shown in the figures are intended to be illustrative and not restrictive). Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the disclosed embodiments.
In one embodiment, the web page relevance module 140 is a browser plug-in that is downloaded from the server computer 110 to the computing device 105. The web page relevance module 140 can be downloaded as a stand-alone component or as a module embedded within another module. In one embodiment, the server computer 110 transmits the web page relevance module 140 to the computing device 105 after the server computer 110 receives a request for the module 140 from a user of the computing device 105.
For purposes of this disclosure (and as described in more detail below with respect to
The DOM defines the objects and properties of all document elements, and the methods (interface) to access them. According to the DOM, everything in an HTML document is a node. The DOM states that the entire document is a document node, every HTML element is an element node, the text in the HTML elements are text nodes, every HTML attribute is an attribute node, and comments are comment nodes. Each node is an object. The DOM nodes can be accessed via an object-oriented programming language such as JavaScript.
An example of a DOM (node) tree 300 associated with an HTML web page is shown in
In this example, the document object 310 is the parent of a root element <html> 315. The root element 315 is a parent to an Element <head> 320 and an Element <body> 325. The Element <head> 320 is the parent to an Element <title> 330, which is a parent to a Text node “My title” 335. The Element <body> 325 is a parent to an Element <a> 340 which has an attribute 345. The Element <a> 340 is a parent to a Text node “My link” 350. The Element <body> is also a parent to Element <h1> 355, which is a parent to a Text node “My header” 360.
In Step 202, the web page relevance module 140 gets the next text element in page sequential order. The text element is akin to elements 335, 350 and 360. The web page relevance module 140 determines if there are any additional text nodes (step 204). If not, the process continues at step 218 (described below). If so, the web page relevance module 140 determines if the text node contains only white space (step 206). If so, the process returns to step 202. If not, the current text element is added to a text element object list in step 208. The web page relevance module 140 then obtains a parent element for the text element in step 210.
The web page relevance module 140 determines if the parent element has already been added to the parent element list in step 212. If so, the process returns to step 202. If not, the web page relevance module 140 records the size and position of the parent element in step 214. The parent element is added to the parent element object list in step 216.
Referring again to step 204, if there are no more text nodes, the web page relevance module 140 calculates weights for each parent element and sorts the parent element list by weight (step 218 of
Referring again to step 224, if there are more parent nodes or if the maximum characters has been exceeded, the parent element list is sorted in sequential order of text node indexes (step 232 of
An embodiment of pseudocode to walk the DOM tree (
In one embodiment, two arrays are populated. One array contains the text nodes found in the web page 130. The second array contains visible elements in the web page 130 that are above a specified cutoff threshold (i.e., its position on the web page is above the cutoff threshold). In one embodiment, the specified cutoff threshold is 2000 pixels. In one embodiment, each entry in the element array contains the text found in the element, plus the node index values (into the other text node array) where the first and last text nodes known to be found in that element can be found (and all intervening values are guaranteed to also be in that same element).
An embodiment of pseudocode to find the most relevant DOM elements (
In one embodiment, the weight formula has the effect of giving a higher weight to elements with a larger size on the web page 130. In one embodiment, the benefit of giving the higher weight to elements with a larger size declines as the element is positioned further down in the document. In one embodiment, the weight cutoff formula
wc=wavg*(T/c)
has the effect of giving the highest weight value to elements positioned higher in the web page (allowing the size of the document to have an effect on how stringent this restriction is). In one embodiment, when this pseudocode is executed, the DOM element array has some elements marked as being the elements which are most likely to be relevant for purpose of text extraction.
An embodiment of extracting text from the most relevant elements (
When the web page relevance module 140 executes the pseudocode above, the relevant text from the HTML document is obtained by scanning the DOM element objects, in order. The algorithm looks for:
1) elements marked as relevant by the pseudocode, or
2) elements adjacent to those elements marked as relevant if certain criteria (about left edges, gaps, and text-to-area ratios) are satisfied.
When such elements are found, the text nodes which are within the index limits of the element object are found and their text is added to the relevant text string. This process continues until all elements are examined, the maximum allowed number of characters are obtained, or a large gap is found between the current relevant element and the next relevant element (typically indicating a potential large gap between an article and its comment section).
In one embodiment, the text of each text element is stored in the relevant text storage if the weight value for a text element is below the threshold weight when its layout or content has a similarity score above a threshold similarity score, the threshold similarity score associated with a text element having a weight value above the threshold weight. For example, if a first text element has content or its layout is similar to a second text element that is above the threshold weight, the first text element can be stored in the relevant text storage because of this similarity. In one embodiment, the similarity score is based on a comparison between the content and/or layout of the two text elements.
The DOM text node object 410 includes two records: a text record 440 and a node index record 450. The text record 440 is the text found in a text node of a web page 130. The node index 450 is an index of the text node found in the web page 130. This index corresponds to the indexes held in the element object, and is thus used to link the text node objects to the parent element objects. Although the objects 405, 410 are described as having records, any data structure can be used, such as arrays, lists, etc.
Execution of the above algorithm described in
Here is an example of the yield from an implementation of this algorithm for the Huffington Post example (from
Here is the yield from an implementation of the above algorithm for the Gizmodo example (from
Memory 904 interfaces with computer bus 902 so as to provide information stored in memory 904 to CPU 912 during execution of software programs such as an operating system, application programs, device drivers, and software modules that comprise program code, and/or computer-executable process steps, incorporating functionality described herein, e.g., one or more of process flows described herein. CPU 912 first loads computer-executable process steps from storage, e.g., memory 904, storage medium/media 906, removable media drive, and/or other storage device. CPU 912 can then execute the stored process steps in order to execute the loaded computer-executable process steps. Stored data, e.g., data stored by a storage device, can be accessed by CPU 912 during the execution of computer-executable process steps.
Persistent storage medium/media 906 is a computer readable storage medium(s) that can be used to store software and data, e.g., an operating system and one or more application programs. Persistent storage medium/media 906 can also be used to store device drivers, such as one or more of a digital camera driver, monitor driver, printer driver, scanner driver, or other device drivers, web pages, content files, playlists and other files. Persistent storage medium/media 906 can further include program modules and data files used to implement one or more embodiments of the present disclosure.
For the purposes of this disclosure, a computer readable medium stores computer data, which data can include computer program code that is executable by a computer, in machine readable form. By way of example, and not limitation, a computer readable medium may comprise computer readable storage media, for tangible or fixed storage of data, or communication media for transient interpretation of code-containing signals. Computer readable storage media, as used herein, refers to physical or tangible storage (as opposed to signals) and includes without limitation volatile and non-volatile, removable and non-removable storage media implemented in any method or technology for the tangible storage of information such as computer-readable instructions, data structures, program modules or other data. Computer readable storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other physical or material medium which can be used to tangibly store the desired information or data or instructions and which can be accessed by a computer or processor.
Those skilled in the art will recognize that the methods and systems of the present disclosure may be implemented in many manners and as such are not to be limited by the foregoing exemplary embodiments and examples. In other words, functional elements being performed by single or multiple components, in various combinations of hardware and software or firmware, and individual functions, may be distributed among software applications at either the client or server or both. In this regard, any number of the features of the different embodiments described herein may be combined into single or multiple embodiments, and alternate embodiments having fewer than, or more than, all of the features described herein are possible. Functionality may also be, in whole or in part, distributed among multiple components, in manners now known or to become known. Thus, myriad software/hardware/firmware combinations are possible in achieving the functions, features, interfaces and preferences described herein. Moreover, the scope of the present disclosure covers conventionally known manners for carrying out the described features and functions and interfaces, as well as those variations and modifications that may be made to the hardware or software or firmware components described herein as would be understood by those skilled in the art now and hereafter.
While the system and method have been described in terms of one or more embodiments, it is to be understood that the disclosure need not be limited to the disclosed embodiments. It is intended to cover various modifications and similar arrangements included within the spirit and scope of the claims, the scope of which should be accorded the broadest interpretation so as to encompass all such modifications and similar structures. The present disclosure includes any and all embodiments of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6366956 | Krishnan | Apr 2002 | B1 |
6567797 | Schuetze et al. | May 2003 | B1 |
6941321 | Schuetze et al. | Sep 2005 | B2 |
7836009 | Paczkowski et al. | Nov 2010 | B2 |
7996462 | Degenhardt et al. | Aug 2011 | B2 |
8005825 | Ghosh | Aug 2011 | B1 |
20020129014 | Kim et al. | Sep 2002 | A1 |
20020161747 | Li | Oct 2002 | A1 |
20040003351 | Sommerer et al. | Jan 2004 | A1 |
20040059708 | Dean et al. | Mar 2004 | A1 |
20040205514 | Sommerer et al. | Oct 2004 | A1 |
20040243645 | Broder et al. | Dec 2004 | A1 |
20050058362 | Kita | Mar 2005 | A1 |
20050197992 | Kipersztok et al. | Sep 2005 | A1 |
20050289452 | Kashi et al. | Dec 2005 | A1 |
20060064411 | Gross et al. | Mar 2006 | A1 |
20060149775 | Egnor | Jul 2006 | A1 |
20060184348 | Schattka et al. | Aug 2006 | A1 |
20060224587 | Zamir et al. | Oct 2006 | A1 |
20070073593 | Perry | Mar 2007 | A1 |
20080010268 | Liao et al. | Jan 2008 | A1 |
20080065737 | Burke et al. | Mar 2008 | A1 |
20080320021 | Chan et al. | Dec 2008 | A1 |
20090228442 | Adams et al. | Sep 2009 | A1 |
20090228774 | Matheny et al. | Sep 2009 | A1 |
20100107088 | Hunt et al. | Apr 2010 | A1 |
20100312771 | Richardson | Dec 2010 | A1 |
20110119248 | Abe et al. | May 2011 | A1 |
20110179020 | Ozzie et al. | Jul 2011 | A1 |
20110208732 | Melton et al. | Aug 2011 | A1 |
20110252060 | Broman et al. | Oct 2011 | A1 |
20110252329 | Broman | Oct 2011 | A1 |
Number | Date | Country |
---|---|---|
2006-500700 | Jan 2006 | JP |
2009-265833 | Dec 2009 | JP |
10-2008-0108248 | Oct 2006 | KR |
10-2008-0111822 | Dec 2008 | KR |
Entry |
---|
Official Action issued in connection with U.S. Appl. No. 12/755,913, dated Oct. 9, 2012. |
Official Action issued in connection with U.S. Appl. No. 12/755,913, dated Jan. 24, 2012. |
Official Action issued in connection with U.S. Appl. No. 12/755,757, dated Oct. 1, 2012. |
International Search Report and Written Opinion issued in PCT/US2011/031087 dated Nov. 25, 2011. |
Number | Date | Country | |
---|---|---|---|
20110252041 A1 | Oct 2011 | US |