This disclosure relates generally to web crawling and, more particularly, to methods and apparatus to automatically crawl the Internet using image analysis.
Web sites are increasingly turning to more visual and interactive layouts to attract consumers and promote a particular image. Such web sites utilize technologies such as “Flash”-based media, which has rich multimedia capabilities and can provide users with a pleasant visual experience, but does not have source code as readily available to a viewer of the web page as previous HTML-based pages. Players for “Flash”-based content may be embedded into a web page and call media information that is displayed in a web browser showing the web page.
The example systems, methods, apparatus, and articles of manufacture described herein are generally used to identify components in a web page using image analysis. The systems, methods, apparatus, and articles of manufacture may be implemented using a web crawler adapted to load web pages and collect information contained in the web page. The web crawler is also adapted to recognize human-recognizable information that is difficult or impossible to recognize using previous web crawling techniques. To accomplish this, an example web crawler renders a web page in a web browser to generate an image, and performs image analysis techniques to determine one or more location(s) within the image that may correspond to relevant or interesting information, depending on the application. When such location(s) or portion(s) of the web page have been identified, the example web crawler utilizes image analysis techniques to determine the type of web page component, such as a button, media content, a media player control, a hyperlink, a text area, an advertisement, an image, or other relevant web page components corresponding to such location(s) and/or portion(s).
In another example described herein, a web crawler is adapted to determine web page components and types of the web page components with assistance from hints. Such hints may be provided by a web page template or other sources. The web page template describes one or more locations in a rendered image of a corresponding web page where the web crawler may expect to find a web page component or a particular type of web page component. The web crawler loads the location(s) or hint(s) from the template and determines the type(s) of any component(s) found in the location(s).
The example web crawlers described herein are advantageously provided with the ability to recognize human-readable web page information that was unrecognizable by previous web crawlers. Web crawlers traditionally rely on the HTML source code of a web page to extract the information from the web page. However, because source code for Flash-based web content does not have source code available, the most relevant or interesting information can sometimes be hidden from the web crawler. The web crawler can determine that Flash content is present in the web page, but is unable to determine what is displayed, or any hyperlinks that may be present in the Flash content to other web pages. Further, the example web crawlers may crawl web pages having similar structures or layouts very efficiently by building or utilizing one or more template(s) corresponding to the web pages.
Web crawling, generally, is the use of a computer or other device to systematically and/or randomly load web pages from the Internet to obtain and/or update information. Some web crawlers are used for such purposes as indexing web pages for search purposes. Web crawling may also be used for identification of media content that is publicly available on the Internet via web sites such as YouTube.
The example web page 200 further includes several hyperlinks 206-220 that point to different web pages, respectively, and, thus, cause a browser to navigate to the corresponding web page when selected by the user. In contrast to the media player 202 and the button 204, the characteristic(s), name(s), and/or destination(s) of the hyperlinks 206-220 (e.g., uniform resource locators (URLs)) may be determined by the source code for the web page 200. For example, the hyperlink 218 is represented by HTML code:
A user sees the hyperlink 218 as a button on the web page 200 because the HTML code includes a call to an image to represent the hyperlink 218. In contrast, the media player 202 is called by the HTML code, and uses external data to generate the content visible and/or audible to the user. The content generated by this external data is not represented by HTML code.
While a browser can determine the destination of the hyperlinks 206-220 without selecting them and navigating to the respective targets, the browser is unable to determine the destination of the button 204 without selecting (i.e., activating) it.
The example image generator 302 is provided with a list 314, containing one or more web pages for the web crawler 102 to identify and/or collect information from. The web pages described in the list 314 may be listed as internet protocol (IP) addresses, uniform resource locators (URLs), or any other descriptive method to instruct the image generator 302 to load a particular web page. The list 314 may further include arguments or instructions identifying different content to be accessed at the same web page location. The image generator 302 receives a first URL corresponding to a web page 316 to load from the list 314. The image generator 302 calls the web page 316 and renders the web page 316 based on received web page information that describes the web page 316. Example web page information may include HTML code, XML code, Java applets, JavaScript, or any other computer-readable code that may be employed to generate some portion and/or all of a web page. The image generator 302 may be implemented using, for example, a commercially available web browser such as any version of Microsoft Internet Explorer, Mozilla Firefox, Apple Safari, or any other commercial web browser. Alternatively, an image generator 302 may be constructed to render the web page 316 in a convenient manner or to provide particular types of information or web page renderings based on the web page information describing the web page 316.
An example rendered web page 318 forms a human-recognizable image, such as would be displayed on a monitor to a user. The image generator 302 sends the rendered web page 318 to the image analyzer 304 (with or without actually displaying the page 318 on a display device). The image analyzer 304 applies image analysis techniques to the rendered web page 318 to identify one or more location(s) where web page components of interest may be found. The location(s) may be generated, for example, as coordinates of the image and/or ranges near a selected point within the image.
The rendered web page 318 and the location(s) determined by the image analyzer 304 are then sent to the component identifier 306. The component identifier 306 analyzes the rendered image on a location by location basis to identify whether there is actually a web page component in the defined area(s) of the location(s). If such a web page component is found, the component identifier 306 identifies a type of the web page component. The component identifier 306 stores the type of the web page component in the storage device 308 in association with additional information corresponding to the web page component, such as the content (e.g., text). For web page components such as media content (e.g., audio, video), the component identifier 306 is provided with a media identifier 322 to identify the media content. If the media identifier 322 is able to identify the media content (e.g., source of an audio/video clip, time segment within the source, owner of the clip, etc.), the media identifier 322 provides the component identifier 306 with the media information, which is stored in the storage device 308 in association with the web page component.
The example component identifier 306 of
Another example technique that may be used to determine the type of a web page component is an “action-reaction” technique. In some examples, the component identifier 306 receives the location(s) from the image analyzer 304. To attempt to determine the type of object associated with such a location, the component identifier 306, initiates an action within the location (e.g., by programmatically simulating an action such as a mouse click). After initiating the action, the component identifier 306 monitors the web page, the image generator 302, and/or an operating system running the image generator 302, to determine if the action results in a reaction. For example, in response to a mouse click event (e.g., an action) over an object, the object may respond by loading another web page, playing media content, selecting a check box or radio button, etc. If such a reaction occurs, the component identifier 306 records the reaction and uses the reaction to determine the type of object by, for example, accessing a look-up table that maps reactions to object types (e.g., an opening a new web page reaction may indicate a button, the opening of a media player may indicate media content, a change in volume may indicate a volume control, etc). If there is no reaction, the component identifier 306 may initiate more of the same types of actions at different points within the subject location to thoroughly search the object, or the component identifier 306 may initiate other types of actions (e.g., keyboard events, mouse drags, right mouse clicks) to attempt to illicit a reaction. As noted, the component identifier 306 may determine the type of component in the location based at least in part on reactions. Once all actions within a location are attempted without any reaction, then the component identifier 306 may identify the object in the location as a nonfunctional type, such as text or images.
In some cases, web sites may have a very similar layout or structure. It is desirable to determine the layout of a page to be analyzed because knowing such layouts can increase the efficiency of the web crawling process. The layout of a web page can be expressed in terms of a template of the web page. The template of the illustrated example preferably identifies location(s) where certain types (or any type) of web page component(s) can be expected to reside. A template is advantageous in crawling multiple web pages (e.g., the web pages 112-120 of
To generate a web page template 320, the component identifier 306 is provided with an instruction to generate a template based on identified web page components (e.g., a flag 323). The flag 323 may be set by, for example a user to instruct the component identifier 306 to build and/or add to a web page template 320. In response to the instruction, the component identifier 306 stores the web page components identified by the image analyzer 304 in the storage device 308 as a web page template 320. An example template 320 is shown in
An example row 416 is populated by the component identifier 306 upon identifying a web page component. The component identifier 306 analyzes a portion of the rendered web page 318 that is identified by the image analyzer 304 as potentially containing a web page component. The component identifier 306 analyzes the portion (e.g., (100,25) to (200,30) of the rendered web page 318) and determines that a hyperlink is present that points to an address. The component identifier 306 stores the component type 404 (i.e., hyperlink), location coordinates 406-412, and target 414 (e.g., the pointed to address) in row 416 with an ID 402 of 1. The component identifier 306 identifies another component in another portion provided by the image analyzer 304, populates the example table 400 with the component information in row 418, and continues population of the table 400 for rows 420-426.
The table 400 may be updated to increase the accuracy of the template by analyzing multiple similar web pages to determine which components are consistently present. For example, two similar web pages (e.g., web pages 116 and 118) may be analyzed sequentially to generate a template. First, the first web page 116 is analyzed by the image analyzer 304 and the component identifier 306 to generate a template (e.g., the template table 400 of
To use a page template 320, the web crawler 102 of the illustrated example is provided with a template reader 310. The template reader 310 generates hints from a page template 320. The page template 320 may be loaded into the template reader 310 from an external source or from the storage device 308. An example template reader 310 is a plug-in module of the type that interfaces with an image generator 302 to perform a function not native to the image generator 302. It should be noted that other implementations of a user-generated page template 320 are possible. For instance, an example page template 320 may be a user-generated Perl language script that is loaded into the template reader 310. In addition, the page template 320 may be a script or other format generated based on template data stored in the storage device 308.
The template reader 310 of the illustrated example provides hints to the hint analyzer 312 based on the page template 320. The hint analyzer 312 then determines hint locations (e.g., coordinates) and/or expected component types for each hint and provides the information to the component identifier 306.
Hints can be used to concentrate on one or more particular portions of the rendered web page 318, potentially without identification of the portion(s) by the image analyzer 304. One image analysis algorithm to analyze a portion defined by a hint is described in Equation 1:
wherein I denotes the web page image, T denotes a template, x and y denote the coordinates of the pixel being checked, and R denotes the result. The summation of Equation 1 is performed over the template and/or over the image patch x′=0 . . . w−1, y′=0 . . . h−1 (where w is the width and h is the height of the portion defined by the hint). Example image processing software is included in the OpenCV library. However, any image processing method, technique, and/or algorithm may be used in combination with, or as a substitute for, the above-described example techniques.
Modem commercial web pages (e.g., the web page 316) often contain some form of media content, which may be identified by a media identifier 322. The media identifier 322 is called by the component identifier 306 when a web page component is identified by the component identifier 306 as having a media type. The component identifier 306 provides the coordinate information in which the media content resides for the media identifier 322 to monitor and determine, for example, an audio and/or video signature. The audio and/or video signature(s) may then be compared by the media identifier 322 to a database of known audio/video signature(s) 324 to identify the media content. If the media content is identified, identification information (e.g., clip name, owner) is stored in the storage device 308 with the component type and the location information where the component is found. Alternatively or additionally, the signature(s) for the media content may be stored in the storage device 308 for later identification.
While an example manner of implementing the web crawler 102 of
The example instructions 500 of
If a page template 320 has not been loaded into the template reader 310 for use in analyzing the web page 316 (block 510), the image analyzer 304 and the component identifier 306 identify the web page components and corresponding types and locations of the web page components in the rendered web page 318 (block 512). Conversely, if a page template is loaded into the template reader 310 (block 510), the component identifier 306 identifies web page components, and the corresponding component types and locations based on the template (block 514).
When the component identifier 306 has identified the components in the rendered web page in block 512 or block 514, the component identifier 306 may call the media identifier 322 to identify any media content associated with components having types identified as media (block 516). The component identifier 306 and/or the image analyzer 304 further scrape the web page for data (block 518). Scraping the web page may include determining keywords from text, identifying advertisers and/or media content, and/or collecting other relevant or useful data and/or meta data. If there are more web pages in the web page target list 314 (block 520), control returns to block 504 to select another web page from the list. If there are no more web pages in the target list 314, the example instructions 500 end execution.
The image analyzer 304 analyzes the rendered web page 318 to detect one or more portions that may contain a web page component (block 602). Next, the component identifier 306 analyzes a portion (e.g., coordinates corresponding to a section of the rendered web page 318) to determine a type of the web page component (block 604). Example component types include a button, media content, a media player control, a hyperlink, a text area, an advertisement, an image, or other types of web page components. After identifying the component type (block 604), the component identifier 306 determines whether a template is to be constructed for the web page (block 606). As mentioned above, the component identifier 306 may be instructed to construct a web page template for the web page to assist in analyzing other web pages having a similar structure or layout. If the component identifier 306 is instructed to construct a page template (e.g., in the storage device 308), the component identifier 306 generates the template and/or adds the web page component type and location to the corresponding template in the storage device 308 (block 508). An example template in the template table 400 of
After the component identifier 306 adds the type and location of the component to the template (block 608), or if the component identifier 306 is not instructed to construct a template (block 606), the component identifier 306 determines whether the web page component is a hyperlink (block 610). If the component is a hyperlink (block 610), the component identifier 306 sends the hyperlink to the image generator 302 to add the target address to the web page list 314 to crawl (block 612). In some cases (such as hyperlinks in Flash media), the image generator 302 must begin to load the web page targeted by the hyperlink to ascertain the target address, which is then added to the web page list 314. In contrast, if the hyperlink is defined by HTML or other source code, the image generator 302 simply adds the target address to the web page list 314 by copying the address from the HTML code.
After adding the target address to the list (block 612), or if the component is not a hyperlink (block 610), the component identifier 306 stores the web page component type and location in the storage device 308 (block 614). The component identifier 306 then checks whether there are additional portions of the web page to analyze (i.e., portions identified by the image analyzer 308) (block 616). If there are additional portions to analyze (block 616), control returns to block 604 to analyze the next portion. If there are no remaining portions (block 616), the instructions 600 end and control advances to block 516 of
When the example instructions 700 begin, the template reader 310 loads a web page template 320 and the hint analyzer 312 translates the web page template 320 into hints (block 702). Next, the component identifier 306 receives a template hint (block 704). The hint may include, for example, information contained in one of the rows 416-428 of the example table 400 of
Once the component identifier 306 has determine a type of the web page component type (block 706), the component identifier 306 compares the type that was determined with the type that is specified in the template, if one exists (block 708). If the component type determined by the component identifier 306 is not consistent with the component type specified in the template 320, the component identifier 306 removes the type corresponding to the component from the web page template 320. Alternatively, if the component identifier 306 determines that there is no component associated with the coordinates defined by the hint, the component identifier 306 removes the component from the web page template 420 altogether. As another alternative, the component identifier 306 may remove the hyperlink target 414 in the case that the component identifier 306 determines the type is a hyperlink with a first target and the template specifies a hyperlink with a second target (block 710).
After removing an inconsistent type and/or target from the template (block 710), or if the component is determined to be consistent with the template (block 708), the component identifier 306 stores the component type corresponding to the location in the storage device 308 (block 712). The component identifier 306 then determines whether the web page component is a hyperlink (block 714). If the web page component is a hyperlink, the component identifier 306 passes the hyperlink to the image generator 302 to add the hyperlink target address to the list of web page targets (block 716). As described above, the image generator 302 may be required to begin loading the web page targeted by the hyperlink to ascertain the target address if, for example, the hyperlink is within a Flash media component. The target address may also be determined by the source or other code of the web page.
After adding the target address to the list of web page addresses (block 716), or if the component type is not a hyperlink (block 714), the template reader 310, the hint analyzer 312, and/or the component identifier 306 determine whether there are additional hints in the template (block 718). If there are additional hints, control returns to block 704 and the component identifier 306 receives another hint. If there are no additional hints, the example instructions 700 end and control returns to block 516 of
The example processor system 800 may be, for example, a desktop personal computer, a notebook computer, a workstation or any other computing device. The processor 802 may be any type of processing unit, such as a microprocessor from the Intel® Pentium® family of microprocessors, the Intel® Itanium® family of microprocessors, and/or the Intel XScale® family of processors. The memories 804, 806 and 808 that are coupled to the processor 802 may be any suitable memory devices and may be sized to fit the storage demands of the system 800. In particular, the flash memory 808 may be a non-volatile memory that is accessed and erased on a block-by-block basis.
The input device 814 may be implemented using a keyboard, a mouse, a touch screen, a track pad, a barcode scanner, an image scanner, or any other device that enables a user to provide information to the processor 802.
The display device 816 may be, for example, a liquid crystal display (LCD) monitor, a cathode ray tube (CRT) monitor or any other suitable device that acts as an interface between the processor 802 and a user. The display device 816 as pictured in
The mass storage device 818 may be, for example, a hard drive or any other magnetic, optical, or solid state media that is readable by the processor 802.
The removable storage device drive 820 may, for example, be an optical drive, such as a compact disk-recordable (CD-R) drive, a compact disk-rewritable (CD-RW) drive, a digital versatile disk (DVD) drive or any other optical drive. It may alternatively be, for example, a magnetic media drive and/or a solid state universal serial bus (USB) storage drive. The removable storage media 824 is complimentary to the removable storage device drive 820, inasmuch as the media 824 is selected to operate with the drive 820. For example, if the removable storage device drive 820 is an optical drive, the removable storage media 824 may be a CD-R disk, a CD-RW disk, a DVD disk or any other suitable optical disk. On the other hand, if the removable storage device drive 820 is a magnetic media device, the removable storage media 824 may be, for example, a diskette or any other suitable magnetic storage media.
The network adapter 822 may be, for example, an Ethernet adapter, a wireless local area network (LAN) adapter, a telephony modem, or any other device that allows the processor system 800 to communicate with other processor systems over a network. The external network 826 may be a LAN, a wide area network (WAN), a wireless network, or any type of network capable of communicating with the processor system 800. Example networks may include the Internet, an intranet, and/or an ad hoc network.
The example systems, methods, apparatus, and articles of manufacture described above are useful for a variety of data applications. For example, the example web crawler may be used to automatically crawl the Internet to build a library of media content. Such a library may then be used to generate digital signatures of the content for use in, for example, digital rights management. After determining the types of content that exist in a particular web page, the web page may be scraped for data, such as media content (e.g., audio, video), advertisements, text, or other types of useful data. By visually determining the types of web page components that are present, media such as Flash content can be identified and scraped.
Another example application for digital rights management includes automatically crawling the web to detect copyright infringement using an established library of digital signatures. The example web crawler may load or generate a template of a media web site such as YouTube, visually identify the components from web pages at the web site including media content, and compare the media content to a library of digital signatures to detect copyrighted content.
Although this patent discloses example systems including software or firmware executed on hardware, it should be noted that such systems are merely illustrative and should not be considered as limiting. For example, it is contemplated that any or all of these hardware and software components could be embodied exclusively in hardware, exclusively in software, exclusively in firmware or in any combination of hardware, firmware and/or software. Accordingly, while the above specification described example systems, methods, apparatus, and articles of manufacture, the examples are not the only way to implement such systems, methods, apparatus, and articles of manufacture. Therefore, although certain example systems, methods, apparatus, and articles of manufacture have been described herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus and articles of manufacture fairly falling within the scope of the appended claims either literally or under the doctrine of equivalents.