A webpage may include hyperlinks to other webpages, for example, to a printer friendly version of the webpage. The printer friendly version may include text without additional graphics and other information tailored to webpage display. In some cases, it may be desirable to locate a printer friendly version of a webpage, such as to run automated information retrieval algorithms or to archive a webpage. A user may manually review a webpage to determine if it contains a link to a printer friendly version or other type of webpage.
The drawings describe example implementations. The drawings illustrate methods being performed in an example order, but the methods may also be performed in other orders. The following detailed description references the drawings, wherein:
In one example, a particular type of uniform resource locator (URL) may be identified from a webpage's source code and extracted. For example, a processor may evaluate webpage source code to automatically locate a uniform resource locator for a particular type of webpage, such as a uniform resource locator address of a printer friendly version of the webpage. The processor may analyze the webpage, source code associated with uniform resource locators. For example, the processor may search for text displayed on a text link for a user to click to navigate to a webpage at the associated uniform resource locator. The webpage, source code may be compared to keywords related to a particular type of uniform resource locator link. For example, text displayed for a link to a webpage at a uniform resource locator may be compared to a list of keywords, such as phrases “print version” or “printer friendly” likely to indicate a uniform resource locator associated with a printer friendly version of the webpage. The uniform resource locator associated with the text may be extracted. In some implementations, the uniform resource locator may be further processed to determine whether it is valid. Automatically detecting a uniform resource locator of a particular type of webpage may allow for information retrieval algorithms, archival algorithms, or other webpage processing algorithms to better navigate through webpages to access particular types of webpage links.
The processor 102 may be any suitable processor, such as a central processing unit (CPU), a semiconductor-based microprocessor, or any other device suitable for retrieval and execution of computer-readable instructions, In one implementation, the electronic device 100 includes logic instead of or in addition to the processor 102. As an alternative or in addition to fetching, decoding, and executing instructions, the processor 102 may include one or more integrated circuits (ICs) or other electronic circuits that comprise a plurality of electronic components for performing the functionality described below. In one implementation, the electronic device 100 includes multiple processors. For example, one processor may perform some functionality and another processor may perform other functionality.
The machine-readable storage medium 103 may be any suitable machine readable medium, such as an electronic, magnetic, optical, or other physical storage device that stores executable instructions or other data (e.g., a hard disk drive, random access memory (RAM), flash memory, storage disk(s) disk array(s), tape drive(s), volatile and/or non-volatile memory, compact disc(s) (CD), digital versatile disc(s) (DVD), floppy disk(s), read-only memory (ROM), programmable ROM (PROM), electronically-programmable ROM (EPROM), electronically-erasable PROM (EEPROM), optical storage disk(s), optical storage device(s), magnetic storage disk(s), magnetic storage device(s), cache(s), and/or any other physical storage device in which data is stored for any duration). The machine-readable storage medium 103 may be, for example, a computer readable non-transitory medium. The machine-readable storage medium 103 may include modules with instructions executable by the processor 102, For example, the machine-readable storage medium 103 may include a uniform resource locator identifying module 104, a uniform resource locator extracting module 105, and a uniform resource locator providing module 106, The uniform resource locator identifying module 104 may include instructions executable by the processor 102 to compare webpage source code to the keywords 101 to identify a portion of the webpage source code likely to contain a uniform resource locator of a particular type. The uniform resource locator extracting module 105 may include instructions executable by the processor 102 to extract a uniform resource locator from the identified portion of the webpage source code, The uniform resource locator providing module 106 may include instructions to provide the extracted uniform resource locator, such as by storing, transmitting, or displaying the uniform resource locator.
Beginning at 201, a processor identifies a portion of webpage source code based on a comparison of the webpage source code to a list of text associated with a type of webpage uniform resource locator, The processor may compare any suitable portion of the webpage source code to the list of text. In some cases, the processor may search the webpage source code for words or phrases found in the list of text associate with the type of uniform resource locator. The processor may identify a portion of the webpage associated with words or phrases located on the webpage that are found in the list of text.
Continuing to 202, the processor locates a uniform resource locator within the identified portion of the webpage source code. For example, the processor may look for characters indicating a uniform resource locator within the identified portion of the webpage. in some cases, the processor may perform validity checks on a located uniform resource locator.
Proceeding to 203, the processor provides the located uniform resource locator. For example, the processor may display, store, or transmit the uniform resource locator. In some cases, the processor accesses the webpage located at the uniform resource locator.
Beginning at 301, a processor identifies a tag in a webpage's source code indicating a uniform resource locator on the webpage. The processor may be any suitable process, such as a Central Processing Unit (CPU). In one implementation, the processor is the processor 102.
The tag may be any suitable hyperlink tag, such as a tag associated with Hypertext Markup Language (HTML) or other webpage description languages. HTML and other tag based markup languages, such as Extensible Markup Language (XML), may include a tree structure of tags with information between tags where each beginning tag has a corresponding ending tag. The beginning tag may appear with a tag identifier between brackets < >. For example, a tag to begin a table may be <table>. An ending tag may appear with a tag identifier and a front slash. For example, a tag to end a table may be </table>.
The processor may evaluate the webpage source code to search for any suitable tag indicating a uniform resource locator. For example, an <a> tag may indicate a link. An example <a href=www.test.com>Test</a> may indicate a link with the text “Test” displayed on the webpage where the link routes to the webpage located at the uniform resource locator www.test.com. The processor may search, for example, to locate an <a> tag within the webpage source code.
The processor may search for the type tag, for example, by comparing the text in the webpage source code to a list of tags. In some implementations, the processor may search a tree structure of tags on the webpage. For example, a document object model (DOM) tree may have a tree structure representing the tags on the webpage. A tree structure may be limited to actual tags on the webpage. Using a tree structure may be beneficial because a hyperlink tag could be, for example, included in a comment or other text not representing an actual tag. In addition, in some cases, a search through a tag structure may be more efficient that searching through the webpage source code text.
The flow chart 300 is discussed in conjunction with
Referring back to
The list of text associated with a type of uniform resource locator may be a dictionary or table of words or phrases likely to be associated with the uniform resource locator, such as likely to be displayed to link to the webpage located at the uniform resource locator or likely to be in the webpage source code within a tag associated with the uniform resource locator. The list of text may be manually or automatically compiled. The type of uniform resource locator may be any suitable type. As an example, the type of uniform resource locator may be a uniform resource locator for a printer friendly version of a webpage, and the list of text may be “print”, “printer”, and “plain text”, The processor may compare webpage source code associated with the identified tag to the list of text. If the text correlates, such as by including the keywords, the processor may determine that the identified tag is likely to contain the type of uniform resource locator.
Referring back to
If no text is included within the <a> tag, at 404 the processor determines whether there is an image tag within the <a> tag, If not, the processor may continue to analyze the next <a> tag within the webpage source code. In some cases, the processor may analyze an image tag even if there is text within the <a> tag, such as if the keywords are not found within the text. If there is an image tag within the <a> tag, at 405 the processor determines whether a title attribute included within the image tag includes the keywords. The title attribute may include the title of the image displayed. A title may in some cases reflect the purpose of the image, For example, for <a href www.test.com><img title=“Printer Friendly” src=“c:\printer.jpg” /></a>, the title “Printer Friendly” may be compared to a dictionary of keywords. If the keywords are included in the title, at 407, the processor may output the portion of the webpage source code associated with the <a> tag.
If the title does not include the keywords, at 406 the processor may compare the image alt attribute to the keywords, The alt attribute may indicate alternate language to be displayed if the image is not loaded onto the webpage. For example, for <a href=www.test.com/printerfriendly><img title=“printer image” alt=“Print” src=“printjpg” /> </a> may indicate that if the printjpg image is unable to load on the webpage, the text “Print” is displayed in place of the image. The processor may compare “Print” to the list of keywords. If the attribute value is not included in the list, the processor continues to analyze the next <a> tag. If the attribute is included in the list, at 407, the processor may output the portion of the webpage source code associated with the <a> tag.
Referring back to
The processor may identify a uniform resource locator within the href value and determine whether the uniform resource locator includes a full path or a relative path. For example, the processor may determine that a uniform resource locator beginning with “1” includes a relative path. If the processor determines that the uniform resource locator includes a relative path, the processor may append the rest of the path to the uniform resource locator. For example, the uniform resource locator may be /printerfriendly, and the webpage may be www.test.com. The processor may update the uniform resource locator for output to be www.test.com/printerfriendly.
The processor may determine whether the domain of the uniform resource locator matches the domain of the webpage. For example, if the webpage is www.test.com and the uniform resource locator is www.computer.com/printerfriendly, the processor may determine that because www.test.com and www.computer.com do not match that the uniform resource locator is likely to be invalid. The processor may look at a setting to see if the domain should be checked. Checking the domain may in some cases increase the likelihood that an identified uniform resource locator is a valid uniform resource locator of the particular type.
Additional types of processing may also be performed. The processor may perform fewer or more checks on the href value. In some cases, the processor performs more than one validation check of the uniform resource locator,
Block 303 of
At 503, the processor determines whether the uniform resource locator includes a full path. For example, the href value may include a relative path. At 504, if the processor determines that the href value does not include the full path, the processor appends the rest of the path to the href value so that it contains a full path.
At 505, the processor determines whether the domain of the uniform resource locator is the same as the webpage domain. If not, the processor may determine that the uniform resource locator is invalid and not output the uniform resource locator. If the uniform resource locator is found to have the same domain, at 506, the processor outputs the uniform resource locator.
Referring back to
Beginning at 601, a processor determines the language of the webpage being analyzed. For example, the processor may analyze the webpage text or source code. In some cases, the processor may receive an indication of the webpage language from a user or other program. In one implementation, the processor looks at the Hypertext Transfer Protocol (HTTP) header of the webpage. For example, the HTTP protocol has a field for a character encoding field that some web servers may set to tell the client browser which language encoding is used in the enclosed HTML file.
Continuing to 602, the processor selects a list of keywords based on the determined language. For example, there may be multiple sets of keywords, and the processor may select the keywords associated with the determined language. Moving to 603, the processor compares a portion of the webpage source code to the selected list of keywords. For example, an <a> tag in webpage code may be located, and text or attribute values associated with the tag may be compared to the selected list of keywords to determine if the portion of the webpage includes a uniform resource locator of a particular type. The uniform resource locator may then be extracted and provided.
Identifying and extracting a uniform resource locator of a particular type within a webpage's source code based on a comparison to keywords and phrases may allow a particular type of uniform resource locator to be automatically identified and extracted. The webpage at the extracted uniform resource locator may then be automatically accessed, such as by a computer program for conducting information retrieval or information archival.