Field of the Invention
Systems and methods consistent with the principles of the invention relate generally to information retrieval and, more particularly, to selecting an image to present in connection with search results relating to a news search.
Description of Related Art
The World Wide Web (“web”) contains a vast amount of information. Search engines assist users in locating desired portions of this information by cataloging web documents. Typically, in response to a user's request, a search engine returns links to documents relevant to the request.
Search engines may base their determination of the user's interest on search terms (called a search query) provided by the user. The goal of a search engine is to identify links to relevant results based on the search query. Typically, the search engine accomplishes this by matching the terms in the search query to a corpus of pre-stored web documents. Web documents that contain the user's search terms are considered “hits” and are returned to the user.
In the case of news documents, users may find it beneficial to see an image in association with the news documents. Oftentimes, however, news documents include multiple images some of which may not be related to the topic of the news documents. This makes it difficult to automatically select appropriate images for the news documents.
According to one aspect consistent with the principles of the invention, a method includes identifying images associated with a document, filtering the images to create a set of candidate images, detecting captions associated with the candidate images, and selecting one of the candidate images to associate with the document based on the detected captions.
According to another aspect, a graphical user interface for display on a computer includes a search result comprising a cluster of news documents and an image associated with the cluster.
According to yet another aspect, a graphical user interface for display on a computer includes a search result comprising a news document and an image associated with the news document.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, explain the invention. In the drawings,
The following detailed description of the invention refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements. Also, the following detailed description does not limit the invention.
Users who search news documents on a network, such as the Internet, may find it beneficial to view images that are associated with the news documents. Systems and methods consistent with the principles of the invention may provide images in association with news documents or clusters of news documents. The systems and methods may select the best image from a group of images to display in connection with a particular news document or cluster.
Clients 110 may include client entities. An entity may be defined as a device, such as a wireless telephone, a personal computer, a personal digital assistant (PDA), a lap top, or another type of computation or communication device, a thread or process running on one of these devices, and/or an object executable by one of these devices. Servers 120-140 may include server entities that gather, process, search, and/or maintain documents in a manner consistent with the principles of the invention. Clients 110 and servers 120-140 may connect to network 150 via wired, wireless, and/or optical connections.
In an implementation consistent with the principles of the invention, server 120 may include a search engine 125 usable by clients 110. Server 120 may crawl a corpus of documents (e.g., web pages), index the documents, and store information associated with the documents in a repository of crawled documents. Servers 130 and 140 may store or maintain documents that may be crawled by server 120. While servers 120-140 are shown as separate entities, it may be possible for one or more of servers 120-140 to perform one or more of the functions of another one or more of servers 120-140. For example, it may be possible that two or more of servers 120-140 are implemented as a single server. It may also be possible for a single one of servers 120-140 to be implemented as two or more separate (and possibly distributed) devices.
A “document,” as the term is used herein, is to be broadly interpreted to include any machine-readable and machine-storable work product. A document may include, for example, an e-mail, a web site, a file, a combination of files, one or more files with embedded links to other files, a news group posting, a blog, a web advertisement, etc. In the context of the Internet, a common document is a web page. Web pages often include textual information and may include embedded information (such as meta information, images, hyperlinks, etc.) and/or embedded instructions (such as Javascript, etc.). A “link,” as the term is used herein, is to be broadly interpreted to include any reference to or from a document.
Processor 220 may include a conventional processor, microprocessor, or processing logic that interprets and executes instructions. Main memory 230 may include a random access memory (RAM) or another type of dynamic storage device that stores information and instructions for execution by processor 220. ROM 240 may include a conventional ROM device or another type of static storage device that stores static information and instructions for use by processor 220. Storage device 250 may include a magnetic and/or optical recording medium and its corresponding drive.
Input device 260 may include a conventional mechanism that permits an operator to input information to the client/server entity, such as a keyboard, a mouse, a pen, voice recognition and/or biometric mechanisms, etc. Output device 270 may include a conventional mechanism that outputs information to the operator, including a display, a printer, a speaker, etc. Communication interface 280 may include any transceiver-like mechanism that enables the client/server entity to communicate with other devices and/or systems. For example, communication interface 280 may include mechanisms for communicating with another device or system via a network, such as network 150.
As will be described in detail below, the client/server entity, consistent with the principles of the invention, performs certain searching-related operations. The client/server entity may perform these operations in response to processor 220 executing software instructions contained in a computer-readable medium, such as memory 230. A computer-readable medium may be defined as a physical or logical memory device and/or carrier wave.
The software instructions may be read into memory 230 from another computer-readable medium, such as data storage device 250, or from another device via communication interface 280. The software instructions contained in memory 230 may cause processor 220 to perform processes that will be described later. Alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement processes consistent with the principles of the invention. Thus, implementations consistent with the principles of the invention are not limited to any specific combination of hardware circuitry and software.
Server 120 may include a news crawling unit 310 and a news image processing unit 320 connected to a repository. The repository may include information associated with documents that were crawled and stored by, for example, news crawling unit 310. The repository may also store images associated with these documents either together with the documents or separate therefrom.
News crawling unit 310 may crawl a corpus of documents, such as the Internet, to identify news documents. For example, news crawling unit 310 may start with a set of addresses (e.g., uniform resource locators (URLs)), such as addresses associated with a set of news sources, and parse the documents associated with these addresses to identify links to other documents. News crawling unit 310 may then parse these other documents to identify links to yet other documents, and so on. News crawling unit 310 may use this information to fetch and index the news documents.
News crawling unit 310 may then extract addresses associated with candidate images from each of the crawled documents. For example, news crawling unit 310 may extract addresses (e.g., URLs) of all images from each of the crawled documents. For each image, news crawling unit 310 may store its associated address and other data, such as the image dimension, the parent addresses (e.g., URLs), the date on which the image was crawled, and the date on which the image was last modified. This “other” data may be determined from information, such as hypertext markup language (HTML) tags, in the source document (i.e., the document from which the image originated).
News crawling unit 310 may also crawl the images based on their extracted addresses and store the images and other information relating to the images. For example, news crawling unit 310 may obtain temporal information and reference count information relating to the images. The temporal information may be useful for identifying “stock images” (i.e., images that are used in multiple news documents relating to the same topic). Stock images may qualify as good candidate images. The reference count information may be useful for identifying images that are linked by multiple news documents on the same host but not directly related to the topics of the news documents, such as images of columnists or news source related icons. Images with high reference counts may be determined to not make good candidate images.
News image processing unit 320 may process the images before and/or after the images are crawled by news crawling unit 310. For example, news image processing unit 320 may identify an initial set of candidate images from the images identified by news crawling unit 310. News image processing unit 320 may filter the set of candidate images based on additional information obtained by news crawling unit 310 during the image crawl.
News image processing unit 320 may select an image from the set of candidate images to associate with a particular news document or a particular cluster of news documents. The particular processing involved in selecting an image will be described in detail below. News image processing unit 320 may store an index that relates the candidate images to the news documents to which they have been associated.
Addresses of the images in each of the crawled documents may be identified and extracted (act 420). For each image, the associated address and other data, such as the image dimension, the parent addresses (e.g., URLs), the date on which the image was crawled, and the date on which the image was last modified, may be stored.
The images may then be processed to create a set of candidate images (act 430).
As described below, one or a combination of various filtering rules, criteria and thresholds may be used to select one or more candidate images. It should be appreciated, however, that aspects of the invention are not limited to any one or a combination of these filtering rules, criteria, or thresholds. Those skilled in the art will recognize from this description of exemplary embodiments that various modifications and alternative implementations are possible.
One exemplary filtering rule may separate candidate images from suspect images based on the shape of the images. For example, a candidate image should not have an irregular shape. Both dimensions of the candidate image should exceed a particular threshold (e.g., 60 pixels). An image with a dimension below the threshold may be identified as a suspect image. Also, a candidate image should have a moderate aspect ratio (e.g., no more than 3:1 or 1:3). In other words, the image should not be too narrow or too tall. A threshold may be used to distinguish acceptable from unacceptable aspect ratios. An image with an aspect ratio below the threshold may be identified as a suspect image.
Another exemplary filtering rule may separate candidate images from suspect images based on their file formats. For example, a candidate image should have a proper image file format, such as the joint photographic experts group (jpeg) format, graphic interchange format (gif) format, tagged image file format (tiff) format, portable document format (pdf) format, bitmap (bmp) format, portable network graphics (png) format, and possibly other common image formats. Images that include formats that are not considered proper image file formats may be identified as suspect images.
Yet another exemplary filtering rule may separate candidate images from suspect images based on whether they include links. For example, a candidate image should not include a link, such that if the image is clicked it will lead to a document with which it is associated. Images that include links are often advertisements and, therefore, may be identified as suspect images.
A further exemplary filtering rule may separate candidate images from suspect images based on where the images are hosted. For example, a candidate image should be hosted by the same organization that hosted the source news document. Images from different domains (e.g., cnn.cojp.com), but associated with the same organization (e.g., cnn.com), may be identified as candidate images. Images from other organizations tend to be advertisements and, therefore, may be identified as suspect images.
There may be exceptions to these rules. Accordingly, good and bad lists may be formed. The good list may include information regarding images from third party cache services, such as Akamai, that may have file formats that are not considered proper file formats and images that include a link (or perhaps sources for which images with a link may be accepted). Images associated with the good list may be identified as candidate images. The bad list may include information regarding news sources that do not want their images shown and suspect images that have, for one reason or another, been previously identified as candidate images.
Image captions associated with the candidate images may be detected (act 520). An image caption may provide the best description of an image. It may also indicate whether the image is related to the topic of the source news document.
When parsing the news documents, information regarding the content of the news documents and the images may be recorded. For example, in the case of HTML documents, runs of continuous text within HTML tags may be collected together and called “text runs.” Each text run and each image may be labeled with the associated HTML table identifier and the HTML table cell identifier, if applicable. In addition, the alternative text for each image may also be recorded. The alternative text for an image may provide a textual alternative to the purpose of the image and may be displayed when the image is not being displayed.
For each image, the alternative text, when present, may be examined. The alternative text may be analyzed to determine whether it contains “poison” words, such as words identifying the author of the image or other words unrelated to the topic of the corresponding news document. When the alternative text does not contain poison words, it may be used as the caption of the image.
When the image does not include alternative text or it is determined that the alternative text should not be used as the image caption, it may be determined whether the image is located within a table. If the image is not located in a table, then the image may be identified as having no image caption because of the ambiguity between the image caption and the body of the news document. If the image is located in a table, however, then the text runs that are within the same table cell as the image may be considered as a candidate for the image caption. If there are no text runs within the same table cell as the image, then the text runs in the neighboring table cell (within a certain cell distance) may be considered as a candidate for the image caption.
When determining whether to use an image caption candidate as the image caption, it may be determined whether the number of candidate text runs exceeds a threshold. For example, when the number of candidate text runs exceeds the threshold, there is a chance that these text runs are not associated with the image, but instead are part of the body of the news document. In this case, the text runs may not be used as the image caption.
It may also be determined whether the candidate text runs are too bulky. For example, the average length of the text runs and/or the largest length of the text runs may be analyzed to determine whether they are below a certain threshold. When the average length of the text runs and/or the largest length of the text runs exceed the threshold, there is a chance that these text runs are not associated with the image, but instead are part of the body of the news document. In this case, the text runs may not be used as the image caption.
An image score may be generated for each candidate image (act 530). In one implementation, the score is based on one or more factors from the group including the image size, a distance to the title of the news document, and an overlap between the image caption and the news document centroid (i.e., the collection of words most representative of the news document).
With regard to the first factor, the relative size of an image in terms of area with respect to the largest image size for the same source document may be determined and used as a scoring factor. For an HTML document, the image size may be determined from the “img” tag associated with the image. If there is no img tag in the document, then the image may receive a zero score for this factor. With regard to the second factor, the distance from the title of the document to the image may be determined. The larger this distance is the more likely that the image is not related to the topic of the document. With regard to the third factor, it may be determined how many times the words in the image caption appear in the body of the document. The more hits the image caption has in the body of the document, the more likely that the image is related to a topic of the document. In other implementations, other techniques may be used to determine whether the image caption is related to a topic of the document.
In one implementation, these factors are used to generate a score for an image. According to one exemplary implementation, an image score may be determined as follows:
Image Score=C_size*(relative size of the image)+C_title_distance/(distance from title)+C_centroid_hit*(number of document centroid hits),
where C_size may refer to a coefficient associated with the size factor, C_title_distance may refer to a coefficient associated with the distance-from-title factor, and C_centroid_hit may refer to a coefficient associated with the centroid hit factor.
The candidate images may be stored in a log file with their corresponding source documents. In one implementation, the candidate images are sorted by their scores in descending order. This log file may permit the images that are later returned by a crawl to be merged with their corresponding source documents.
Returning to
The best document level image may then be selected (act 450). According to one implementation, the best document level image is selected after the crawl because some candidate images may not be reachable and dimension information may not be known for some candidate images prior to the crawl. Unreachable candidate images may be discarded. For example, a predefined timeout period may be set for the image crawl. If a candidate image is unreachable at the end of this timeout period, then it may be discarded.
The dimension of each candidate image that was successfully fetched during the crawl may be analyzed again. A candidate image that has an irregular shape may be discarded. As described above, both dimensions of the candidate image should exceed a particular threshold (e.g., 60 pixels). A candidate image with a dimension below the threshold may be discarded. Also, a candidate image that does not have a moderate aspect ratio (e.g., no more than 3:1 or 1:3) may be discarded. In other words, the candidate image should not be too narrow or too tall. As described above, a threshold may be used to distinguish acceptable from unacceptable aspect ratios. A candidate image with an aspect ratio below the threshold may be discarded.
An image histogram of reference counts may also be constructed to filter out columnist images and news source related icons. The histogram may be useful for identifying images that are linked by multiple news documents on the same host but not directly related to the topics of the news documents, such as columnist images and news source related icons. Candidate images with high reference counts may be discarded.
Additional filtering rules may be used to further filter the candidate images. For example, candidate images that contain text may be discarded. Candidate images that look more like clip-art, as opposed to photographs, may be discarded. Candidate images that are all the same color may be discarded. Other criteria may alternatively be used to filter out bad images.
The best document level image may be selected as the highest scoring candidate image of the remaining candidate images associated with a news document.
The best cluster level image may then be selected (act 460). A cluster is a collection of news documents relating to the same topic. Within a cluster, there might be multiple news documents that include images. According to one implementation, the best cluster level image may be determined based on the rank of the source news document within the cluster. For example, the higher the news document is ranked within the cluster, the more likely its image may be representative of the cluster.
The best cluster level image may also, or alternatively, be determined based on an overlap of an image caption and the cluster centroid. For example, it may be determined how many times the words in the image caption appear in the body of the documents in the cluster. The more hits the image caption has in the body of the documents, the more likely that the image is related to the topic of the cluster.
In one implementation, the rank of the source news document may be one factor and the amount of overlap between the image caption and cluster centroid may be another factor in generating an overall score for an image. In other implementations, one of these factors may be weighted more heavily than the other. In yet other implementations, other factors may also be considered in generating the overall score.
A search may then be performed to identify news documents that are relevant to the search query (act 620). For example, a corpus or repository of news documents may be examined to identify news documents that include a term of the search query. The news documents may then be ranked according to one or more conventional ranking factors.
It may then be determined whether to present the search results as a list of news documents or as a list of clusters of news documents (act 630). This determination may be pre-established by search engine 125. For example, the search results may always initially be presented as a list of news documents or a list of clusters. The user may then be given the option of having the search results presented another way. Alternatively, the user may initially be given the option of specifying how the search results will be presented.
If the search results are to be presented as a list of clusters, the news documents (of the search results) may be formed into one or more clusters according to the topics to which they relate (act 640). Techniques for forming related documents into clusters are known in the art and, therefore, will not be discussed further. The clusters may then be ranked according to one or more conventional ranking factors. Images for the clusters may then be determined (act 650), as described above. If the search results are to be presented as a list of documents, images for the documents may be determined (act 660), as also described above.
The search results may then be presented to the user via a graphical user interface (act 670). For example, the search results may be presented as a list of links to news documents with their associated images. Alternatively, the search results may be presented as a list of clusters of news documents with their associated images.
Search engine 125 may perform a search of a repository or corpus for news documents that are relevant to the search query. There are many ways to determine document relevancy. For example, documents that contain one or more of the search terms of the search query may be identified as relevant. Documents that include a greater number of the search terms may be identified as more relevant than documents that include a fewer number of the search terms.
Search engine 125 may then present the relevant news documents to the user as clusters. As shown in
As further shown in
Search engine 125 may perform a search of a repository or corpus for news documents that are relevant to the search query. Search engine 125 may then present the relevant news documents to the user as a list of documents. As shown in
As further shown in
Systems and methods consistent with the principles of the invention may present relevant images in association with news documents and clusters of news documents.
The foregoing description of preferred embodiments of the present invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention.
For example, while series of acts have been described with regard to
In one implementation, server 120 may perform most, if not all, of the acts described with regard to the processing of
Further, while described in the context of news searches, systems and methods consistent with the principles of the invention may be applicable to non-news searches, such as product searches.
It will also be apparent to one of ordinary skill in the art that aspects of the invention, as described above, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. The actual software code or specialized control hardware used to implement aspects consistent with the principles of the invention is not limiting of the present invention. Thus, the operation and behavior of the aspects were described without reference to the specific software code—it being understood that one of ordinary skill in the art would be able to design software and control hardware to implement the aspects based on the description herein.
No element, act, or instruction used in the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. Further, the phrase “based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise.
This application is a continuation of U.S. patent application Ser. No. 12/195,167, filed Aug. 20, 2008, which is a continuation of U.S. patent application Ser. No. 10/804,180, filed Mar. 19, 2004, the disclosures of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5477274 | Akiyoshi | Dec 1995 | A |
5546572 | Seto et al. | Aug 1996 | A |
5832470 | Morita | Nov 1998 | A |
5886698 | Sciammarella et al. | Mar 1999 | A |
5893095 | Jain | Apr 1999 | A |
5893908 | Cullen et al. | Apr 1999 | A |
5963659 | Cahill et al. | Oct 1999 | A |
5970183 | Amemiya et al. | Oct 1999 | A |
6169998 | Iwasaki et al. | Jan 2001 | B1 |
6240424 | Hirata | May 2001 | B1 |
6636648 | Loui | Oct 2003 | B2 |
6650761 | Rodriguez | Nov 2003 | B1 |
6922700 | Aggarwal et al. | Jul 2005 | B1 |
7031968 | Kremer et al. | Apr 2006 | B2 |
7099860 | Liu | Aug 2006 | B1 |
7214065 | Fitzsimmons, Jr. | May 2007 | B2 |
7298930 | Erol | Nov 2007 | B1 |
7333984 | Oosta | Feb 2008 | B2 |
7382903 | Ray | Jun 2008 | B2 |
7477780 | Boncyk | Jan 2009 | B2 |
7580568 | Wang et al. | Aug 2009 | B1 |
8775436 | Zhou et al. | Jul 2014 | B1 |
20010011365 | Helfman | Aug 2001 | A1 |
20010047373 | Jones | Nov 2001 | A1 |
20020038299 | Zernik et al. | Mar 2002 | A1 |
20020091535 | Kendall | Jul 2002 | A1 |
20020107847 | Johnson | Aug 2002 | A1 |
20030046315 | Feig | Mar 2003 | A1 |
20030055810 | Cragun et al. | Mar 2003 | A1 |
20030081931 | Nanba | May 2003 | A1 |
20030110181 | Schuetze | Jun 2003 | A1 |
20030195883 | Mojsilovic | Oct 2003 | A1 |
20040015517 | Park et al. | Jan 2004 | A1 |
20040019601 | Gates | Jan 2004 | A1 |
20040049734 | Simske | Mar 2004 | A1 |
20040161153 | Lindenbaum | Aug 2004 | A1 |
20040202349 | Erol | Oct 2004 | A1 |
20050022106 | Kawai et al. | Jan 2005 | A1 |
20050050150 | Dinkin | Mar 2005 | A1 |
20050080769 | Gemmell et al. | Apr 2005 | A1 |
20050165743 | Bharat | Jul 2005 | A1 |
20060004717 | Ramarathnam et al. | Jan 2006 | A1 |
20070098266 | Chiu | May 2007 | A1 |
20090076797 | Yu | Mar 2009 | A1 |
Number | Date | Country |
---|---|---|
48568 | Mar 1982 | EP |
Entry |
---|
Wynblatt et al. (“Web Page Caricatures: Multimedia Summaries for WWW Documents,” IEEE Proc. Int'l Conf. on Multimedia Computing and Systems, Jun. 22-Jul. 1, 1998, pp. 194-199). |
Frankel et al. (“WebSeer: An Image Search Engine for the World Wide Web,” The University of Chicago, Computer Science Dept., TR 96-14, Aug. 1, 1996). |
Brown et al. (“CWIC: Continuous Web Image Collector,” Proceedings of ACM 38th Southwest Regional Conference, 2000, pp. 244-252). |
Helfman et al. (“Image Representation for Accessing and Organizing Web Information.” Proc. SPIE vol. 4311 (2001)). |
Baidu; www.baidu.com; Jun. 21, 2004 (print date); 8 pages. |
Hong Zhou et al. “Image Selection for News Search”; U.S. Appl. No. 10/804,180, filed Mar. 19, 2004, 37 pages. |
Number | Date | Country | |
---|---|---|---|
Parent | 12195167 | Aug 2008 | US |
Child | 14288685 | US | |
Parent | 10804180 | Mar 2004 | US |
Child | 12195167 | US |