The Internet provides a vast amount of resources that may be searched in a variety of ways providing an Internet user with easy access to desired information. However, the same accessibility that makes the Internet such a valuable and useful tool also creates an environment which lends itself to unauthorized copying of information. Web crawlers continuously traverse the Internet to retrieve information for the purpose of, among other things, maintaining current information in a search engine index. As the Internet continues to develop, various standards are evolving that allow owners of websites to control web crawler access to information contained within their website.
Unfortunately, a problem with the various standards that are evolving is that they provide the owner of a website (or publisher of content associated therewith) with too little flexibility. A website owner can either choose to allow a web crawler access to a particular content item, or choose to prevent the web crawler's access. This binary solution of allow versus prevent, however, has several limitations. For example, there may be a website owner who includes a number of images on a website and is offering the images for sale. The owner may desire that the images appear as a result to an image search on the Internet for advertisement purposes. The owner, however, may have reservations due to the pervasiveness of unauthorized copying on the Internet and the potentially detrimental effect copying will have on the value of his images. Because of his reservations, the owner will likely choose to disallow web crawlers from accessing images on the website and, in doing so, abstain from a potentially lucrative advertising opportunity.
Embodiments of the present invention relate to computer readable media, systems, and methods for controlling search indexing. In embodiments, a search index control instruction is received and, if permitted, content pertaining to the received instruction is indexed and presented in accordance with the instruction. Search index control instructions may include, by way of example only, exclusionary instructions (e.g., excluding specified domains from linking to portions of the content associated with a website) and modification instructions (e.g., permitting indexing and presentation of content associated with a website but only in a modified form to reduce the risk of content theft). Facilitating control of search indexing in this way permits content owners and/or publishers to exercise increased flexibility in defining access to their content thus increasing the likelihood that they will permit their content to be indexed.
It should be noted that this Summary is provided to generally introduce the reader to one or more select concepts described below in the Detailed Description in a simplified form. This Summary is not intended to identify key and/or required features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The present invention is described in detail below with reference to the attached drawing figures, wherein:
The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms “step” and/or “block” may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
Embodiments of the present invention provide computer-readable media, systems, and methods for controlling search indexing. In various embodiments, one or more search index control instructions are received and content to which such instruction(s) pertain is indexed in accordance therewith. Further, in various embodiments, the content is presented in accordance with the one or more received instructions. While embodiments discussed herein refer to accessing web pages on the Web via the Internet, it will be understood by one of ordinary skill in the art that embodiments are not limited to the Internet. For example, other embodiments may access content via a private network.
Accordingly, in one aspect, the present invention is directed to one or more computer readable media having instructions embodied thereon that, when executed, perform a method for controlling search indexing. The method includes receiving a search index control instruction, and processing website content in accordance with the search index control instruction. The method further includes determining if indexing content to which such instructions pertain is permitted. If it is determined that indexing of the content to which the search index control instruction pertains is permitted, the respective content is indexed in accordance with the instruction. If permitted, the indexed content may be presented in accordance with the appropriate search index control instruction, for instance, in response to a search query.
In another aspect, the present invention is directed to a computerized system for controlling search indexing. The system includes a receiving component configured to receive at least one search index control instruction, a determining component configured to analyze the received search index control instruction to determine if indexing of content associated therewith is permitted, an indexing component configured to index content associated with the search index control instruction if it is determined that indexing thereof is permitted, and a database for storing the indexed content in association with the received search index control instruction.
In yet another aspect, the present invention is directed to a method for controlling search indexing. The method includes receiving a search index control instruction pertaining to content associated with at least a portion of a website, determining, based upon the search index control instruction, if indexing of the content to which it pertains is permitted, and if it is determined that indexing of the content to which the received search index control instruction pertains is permitted, indexing the content in accordance with the instruction.
Having briefly described an overview of embodiments of the present invention, an exemplary operating environment is described below.
Referring to the drawing figures in general, and initially to
Embodiments of the present invention may be described in the general context of computer code or machine-usable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program modules including routines, programs, objects, components, data structures, and the like, refer to code that performs particular tasks or implements particular abstract data types. Embodiments of the invention may be practiced in a variety of system configurations, including, but not limited to, hand-held devices, consumer electronics, general purpose computers, specialty computing devices, and the like. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in association with both local and remote computer storage media including memory storage devices. The computer useable instructions form an interface to allow a computer to react according to a source of input. The instructions cooperate with other code segments to initiate a variety of tasks in response to data received in conjunction with the source of the received data.
Computing device 100 includes a bus 110 that directly or indirectly couples the following elements: memory 112, one or more processors 114, one or more presentation components 116, input/output (I/O) ports 118, I/O components 120, and an illustrative power supply 122. Bus 110 represents what may be one or more busses (such as an address bus, data bus, or combination thereof). Although the various blocks of
Computing device 100 typically includes a variety of computer-readable media. By way of example, and not limitation, computer-readable media may comprise Random Access Memory (RAM); Read Only Memory (ROM); Electronically Erasable Programmable Read Only Memory (EEPROM); flash memory or other memory technologies; CDROM, digital versatile disks (DVD) or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, carrier wave or any other medium that can be used to encode desired information and be accessed by computing device 100.
Memory 112 includes computer storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, nonremovable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical disc drives, and the like. Computing device 100 includes one or more processors that read from various entities such as memory 112 or I/O components 120. Presentation component(s) 116 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, and the like.
I/O ports 118 allow computing device 100 to be logically coupled to other devices including I/O components 120, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc.
Turning now to
Database 202 is configured to store content in accordance with at least one search index control instruction. In various embodiments, such content may include, without limitation, one or more images, one or more audio files, one or more multimedia files, other information associated with a website, and any combination thereof. Search index control instructions may include, by way of example only, one or more character strings included in a robots.txt file, one or more character strings included in source code of a website, and one or more character strings associated with shared information in a private network. In various embodiments, the database 202 is configured to be searchable for content according to the one or more index control instructions associated therewith. It will be understood and appreciated by those of ordinary skill in the art that the information stored in database 202 may be configurable and may include any information relevant to indexed content and/or search index control instructions. The content and/or volume of such information are not intended to limit the scope of embodiments of the present invention in any way. Further, though illustrated as a single, independent component, database 202 may, in fact, be a plurality of databases, for instance, a database cluster, portions of which may reside on a computing device associated with the server 204, on the user device 208, on another external computing device (not shown), or any combination thereof.
The user device 208 may be any type of computing device, such as computing device 100 described with reference to
The server 204 may be any type of computing device, such as computing device 100 described with reference to
The receiving component 212 is configured to receive at least one search index control instruction pertaining to content associated with a portion of a website. In various embodiments, by way of example, the receiving component 212 may receive a search index control instruction by traversing the Internet with a web crawler. In various embodiments, a web crawler may automatically traverse the hypertext structure of the Internet. For example, without limitation, in various embodiments, several algorithms may be used alone, or in combination, to optimize traversal in order to access as much of the vast information available on the Internet as possible. Web crawlers and web crawling algorithms are commonplace in various networking environments and one of ordinary skill in the art would readily understand how to apply crawling algorithms to achieve more efficient web crawling. Accordingly, web crawlers and crawling algorithms are not further discussed herein.
The receiving component 212 may further retrieve information associated with at least one website, for instance, from an associated robots.txt file, source code, or sitemap, and analyze the information to locate one or more search index control instructions. A search index control instruction embodied in a website's robots.txt file provides the owner or publisher of content associated with a portion of a website with control over how such content may be used by a search engine. A search index control instruction embodied in the source code, e.g., HTML file, associated with the website itself provides the owner or publisher of content associated with a website for which site control is not feasible (e.g., wherein one or more web pages are independently controlled) to permit access to content only in accordance with specified instruction. Further, a search index control instruction embodied in the source code for a website may permit or exclude link access to certain portions of a website independently. A search index control instruction embodied in the sitemap of a website provides the owner or publisher of content associated with a site with the ability to include an overview of content associated with the website along with exclusion and/or modification instructions with regard to each content item.
A search index control instruction may have various levels of scope as well as various functionality. In various embodiments, the search index control instruction may be a site level instruction configured to instruct the search index with regard to access to information on an entire site. For example, without limitation, a site level instruction may instruct a search index to only present a thumbnail image of every image associated with the entire site. In various other embodiments, the search index control instruction may be a page level instruction configured to instruct the search index with regard to a particular page within a website. For example, without limitation, a page level instruction may instruct a search index to only provide a short clip of every audio or multimedia file included within a single page. In yet other various embodiments, the search index control instruction may be a link level instruction configured to instruct the search index with regard to a particular link within a single page. For example, without limitation, a link level instruction may instruct a search index to only display the linked image with a border or character string superimposed over the image.
Further, in other various embodiments, the search index control instruction may be a domain instruction configured to specify one or more domains that are allowed to link to images on a particular website. For example, without limitation, msnbc.com may wish to allow msn.com to link to its images. When an Internet user searches for an image using an image search engine, an msnbc.com image appearing as a result might be associated with either msnbc.com or msn.com. If msnbc.com has provided a domain instruction included in a search index control instruction, however, the image search engine would not recognize unauthorized websites that link to an msnbc.com image. For instance, if cnn.com linked to the image without authorization in the domain instruction, the image search engine results page would not display the cnn.com link in association with an msnbc.com image.
In various embodiments, the receiving component 212 may copy information from websites accessed during web crawling and store such information, in accordance with content to which such information pertains, for instance, in database 202.
The determining component 214 is configured to determine, in accordance with the received search index control instruction(s), if indexing of the content to which such received instruction(s) pertains is permitted. Indexing of content may be permitted if no search index control instructions are associated therewith or in circumstances wherein presentation of the content is permitted in accordance with one or more search index control instructions. As more fully described below, presentation of content may be permitted in association with a search index control instruction permitting any and all websites to link thereto, permitting only specified websites to link thereto, or permitting all but one or more specified websites to link thereto. The nature and extent to which presentation is permitted is stored in association with the indexed content, e.g., in database 202, through storage of the appropriate search index control instruction(s). If it is determined by determining component 214 that indexing of the content to which a received search index control instruction pertains is not permitted, such content is not indexed or stored and, accordingly, will not be retrieved in response to a search query (as more fully described below). However, in some embodiments, the search index control instruction disallowing indexing may be stored, if desired.
The indexing component 216 is configured to index content associated with at least one received search index control instruction if it is determined (by determining component 214) that indexing of such content is permitted. Indexed content may be retrieved and presented in accordance with any associated search index control instructions, for instance, if such content is determined to satisfy a search query, as more fully described below. If it is determined by determining component 214 that indexing of the content to which a received search index control instruction pertains is not permitted, such content is not indexed or stored and, accordingly, will not be retrieved in response to a search query (as more fully described below). However, in some embodiments, the search index control instruction disallowing indexing may be stored, if desired.
The query receiving component 218 is configured to receive at least one search query, e.g., from user input received at user device 208. Upon receipt of a search query, the searching component 220 is configured to search the database for indexed content that satisfies the search query. Upon locating indexed content that satisfies the search query, the determining component 214 is further configured to determine whether, in accordance with any search index control instructions which pertain to the satisfying content, presentation of the content in response to the search query is permitted. If it is determined that presentation is not permitted, the content is disregarded as a satisfying result to the search query. If, however, it is determined that presentation is permitted, such content is presented (e.g., displayed) by presentation component 210 of the user device 208 in accordance with any search index control instructions pertaining thereto.
It will be understood and appreciated by those of ordinary skill in the art that additional components not shown may also be included within any of system 200, database 202, server 204, and user device 208. Any and all such variations, and any combinations thereof, are contemplated to be within the scope of embodiments of the present invention.
Turning now to
Next, as indicated at block 312, website content is processed in accordance with the search index control instruction. By way of example, the search index control instruction may relate to an image within a website's content and the display of the image by other websites. In various embodiments, the image will be processed to prepare the image for indexing and modified presentation of the image, the details of which are discussed in further detail herein. In various other embodiments, processed website content may include a multimedia file, video file, an audio file, or any other information prepared for indexing and modified presentation.
Next, as indicated at block 314, it is determined if indexing of content to which the received search index control instruction pertains is permitted. If it is determined that indexing is not permitted, such content is not indexed. This is indicated at block 316. If, however, it is determined that indexing of the content to which the received search index control instruction pertains is permitted, such content is indexed (e.g., utilizing indexing component 216 of
Next, as indicated at block 320, indexed content may be presented in accordance with the received search index control instruction, e.g., by presentation component 210 of
Turning now to
Next, as indicated at block 416, website content is processed in accordance with the search index control instruction as previously discussed with reference to
Turning now to
Next, as indicated at block 514, it is determined (for instance, utilizing determining component 214 of
Next, as indicated at block 518, a search query is received, e.g., by query receiving component 218 of
Subsequently, the indexed content is searched (for instance, utilizing searching component 220 of
In each of the exemplary methods described herein, various combinations and permutations of the described blocks or steps may be present and additional steps may be added. Further, one or more of the described blocks or steps may be absent from various embodiments. It is contemplated and within the scope of the present invention that the combinations and permutations of the described exemplary methods, as well as any additional or absent steps, may occur. The various methods are herein described for exemplary purposes only and are in no way intended to limit the scope of the present invention.
The present invention has been described herein in relation to particular embodiments, which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present invention pertains without departing from its scope.
From the foregoing, it will be seen that this invention is one well adapted to attain the ends and objects set forth above, together with other advantages which are obvious and inherent to the methods, computer-readable media, and graphical user interfaces. It will be understood that certain features and sub-combinations are of utility and may be employed without reference to other features and sub-combinations. This is contemplated by and within the scope of the claims.