Internet search engines find documents that are responsive to a query by comparing the content of the query to the content in various documents. Search engines may build an index using a web crawler that goes from page to page on the Internet and records the links on the page along with a description of document content. Once the index is built, it can be used to retrieve a document that matches a query.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claim subject matter, nor is it intended to be used as an aid in determining the scope of the claim subject matter.
Embodiments of the present invention generally relate to consolidating content found in multiple related documents (e.g., web pages) into a single synthetic search document for the purpose of presenting descriptions of the multiple documents to a search engine. The search engine may then search and index one document (i.e., the synthetic search document) instead of indexing each of the multiple documents. In one embodiment, the multiple documents are excluded from separate indexing by adding a meta or http header data tag to each of the multiple documents that indicates to a search engine the multiple documents are not to be indexed. In one embodiment, the multiple documents consolidated into the synthetic search document are related to each other. For example, the documents may be related based on association with a single user, a common subject matter, or combination of factors. Supplemental information that describes all of the related pages may be added to this synthetic search document without modifying any of the consolidated documents. A search engine may be programmed to understand the various meta data tags and take advantage of the supplemental information included in the synthetic documents. The synthetic search document includes subpart identifiers that allow a search engine to locate the document associated with the subpart identifier.
The present invention is described in detail below with reference to the attached drawing figures, wherein:
The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms “step” and/or “block” may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
Accordingly, in one embodiment, one or more computer-readable media having computer-executable instructions embodied thereon for performing a method of preparing a plurality of related documents to be searched by a search engine is provided. Each of the plurality of related documents is reachable by a unique identifier. The method includes, for each of the plurality of related documents, deriving a set of descriptive information that describes content in one of the plurality of related documents, thereby resulting in a plurality of descriptive information sets that includes a separate set of descriptive information for each of the plurality of related documents. The method also includes, for each of the plurality of related documents, generating a subpart identifier that contains navigation information that allows the search engine to navigate to an individual related document associated with the subpart identifier. The subpart identifier does not contain a URL, thereby resulting in a plurality of subpart identifiers that includes a separate subpart identifier for each of the plurality of related documents. The method further includes integrating the plurality of descriptive information sets and the plurality of subpart identifiers into a synthetic search document. The synthetic search document is a single document that contains multiple subparts. Each subpart includes an individual set of descriptive information paired with a single subpart identifier that contains the navigation information for an individual document from which the individual set of descriptive information is derived.
In another embodiment, one or more computer-readable media having computer-executable instructions embodied thereon for performing a method of locating information within a plurality of related documents is provided. Each of said plurality of related documents includes an ability to be separately reachable by a unique identifier. The method includes receiving a search query and determining that a set of descriptive information within a synthetic search document matches the search query. The synthetic search document is a single document that contains a subpart for each of the plurality of related documents, thereby forming a plurality of subparts. Each subpart includes an individual set of descriptive information that describes content in one related document and an associated subpart identifier that contains navigation information that allows a search engine to navigate to the one related document. The method also includes presenting search results that include a link to an individual document from which said set of descriptive information is derived by using the navigation information in an individual subpart identifier associated with the set of descriptive information to generate the link.
In yet another embodiment, one or more computer-readable media having computer-executable instructions embodied thereon for performing a method of preparing a plurality of related web pages in a social networking web site to be searched by a search engine is provided. Each of the plurality of related web pages includes an ability to be separately reachable by a unique identifier. The method includes, for each of the plurality of related web pages in the social networking web site, deriving a set of descriptive information that describes content in one of the plurality of related web pages, thereby resulting in a plurality of descriptive information sets that includes a separate set of descriptive information for each of the plurality of related web pages. Each of the plurality of related web pages includes a common subject matter. The method further includes, for each of the plurality of related web pages, generating a subpart identifier that contains navigation information that allows the search engine to navigate to an individual related web page associated with the subpart identifier, thereby resulting in a plurality of subpart identifiers that includes a separate subpart identifier for each of the plurality of related web pages. The method further includes, integrating the plurality of descriptive information sets and the plurality of subpart identifiers into a synthetic search document. The synthetic search document is a single document that contains multiple subparts. Each subpart includes an individual set of descriptive information paired with a single subpart identifier that contains the navigation information for an individual web page from which the individual set of descriptive information is derived. The method further includes adding information to each of the plurality of related web pages that indicates to the search engine that each of the plurality of related web pages should not be individually indexed, thereby enabling the search engine to respond to a query by searching said synthetic search document rather than each of the plurality of related web pages.
Having briefly described an overview of embodiments of the present invention, an exemplary operating environment suitable for use in implementing embodiments of the present invention is described below.
Referring to the drawings in general, and initially to
The invention may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program components including routines, programs, objects, components, data structures, and the like, refer to code that performs particular tasks, or implements particular abstract data types. Embodiments of the present invention may be practiced in a variety of system configurations, including handheld devices, consumer electronics, general-purpose computers, specialty computing devices, etc. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
With continued reference to
Computing device 100 typically includes a variety of computer-readable media. By way of example, and not limitation, computer-readable media may comprise Random Access Memory (RAM); Read Only Memory (ROM); Electronically Erasable Programmable Read Only Memory (EEPROM); flash memory or other memory technologies; CDROM, digital versatile disks (DVDs) or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices; or any other medium that can be used to encode desired information and be accessed by computing device 100.
Memory 112 includes computer storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, nonremovable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, etc. Computing device 100 includes one or more processors that read data from various entities such as memory 112 or I/O components 120. Presentation component(s) 116 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, etc. I/O ports 118 allow computing device 100 to be logically coupled to other devices including I/O components 120, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc.
Turning now to
Search engine 210 is a combination of hardware and software. The hardware aspect includes a computing device that includes a CPU, short-term memory, long-term memory, and one or more network interfaces. A network interface is used to connect to network 240. The network interface could be wired, wireless, or both. Software on the search engine 210 communicates with other computers connected to network 240. The software facilitates searching available documents, such as web pages, stored on the computers connected to the network. In one embodiment, the search engine builds an index that includes keywords describing the searched documents along with location information indicating how to locate the searched documents. For example, the location information may include a uniform resource locator (“URL”). The search engine may search the computers connected to the network using a web crawler that automatically opens the documents and analyzes the content. The web crawler may track the documents it visited.
The search engine 210 may present a search document over network 240 that is capable of receiving search queries from users. The search engine 210 then identifies documents that match the query and transmits a page of search results back to the requesting user. The search engine includes a variety of computer-readable media and the ability to access and execute instructions contained on the media. The above description of hardware and software is illustrative only. Many other features of search engine 210 are not listed so as to not obscure embodiments of the present invention.
Web server 220 is a combination of hardware and software. The hardware aspect includes a computing device that includes a CPU, short-term memory, long-term memory, and one or more network interfaces. A network interface is used to connect to network 240. The network interface could be wired, wireless, or both. Software on the web server 220 communicates with other computers connected to network 240. The software facilitates transmitting requested web pages to a requesting computer device, such as client computing device 230. The web server 220 may store large numbers of web pages. The web pages hosted by the web server 220 may be searched and indexed by the search engine 210. The above description of hardware and software is illustrative only. Many other features of a search engine 210 are not listed so as to not obscure embodiments of the present invention.
It will be understood by those of ordinary skill in the art that networking architecture 200 is merely exemplary. While the search engine 210 and web server 220 are illustrated as single boxes, one skilled in the art will appreciate that they are scalable. For example, the web server 220 may in actuality include multiple boxes in communication. The single unit depictions are meant for clarity, not to limit the scope of embodiments in any form.
The client computing device 230 may be a type of computing device, such as device 100 described above with reference to
Network 240 may include a computer network or combination thereof. Examples of networks configurable to operate as network 240 include, without limitation, a wireless network, landline, cable line, digital subscriber line (“DSL”), fiber-optic line, local area network (“LAN”), wide area network (“WAN”), metropolitan area network (“MAN”), or the like. Network 280 is not limited, however, to connections coupling separate computer units. Rather, network 220 may also comprise subsystems that transfer data between servers or computing devices. For example, network 240 may also include a point-to-point connection, the Internet, an Ethernet, an electrical bus, a neural network, or other internal system.
Turning now to
Web page hierarchy 300 includes a homepage 305. The homepage 305 may be described as the root node of the web page hierarchy 300. All other web pages may be described as child nodes of homepage 305. The homepage 305 links to four user pages associated with user's 1, 2, 3, and 4. The user pages may be home pages for a user's profile. The user pages include “user page 1” 310, “user page 2” 330, “user page 3” 340, and “user page 4” 350. “User page 1” 310 links to “photo homepage 1” 311. “Photo homepage 1” 311 links to “album 1” 314 and “album 2” 315. In an embodiment of the present invention, a photo homepage may include links to one or more photo albums that may include text describing the photo album. Photo albums include links to picture pages that may include text describing the pictures. “Photo album 1” 314 includes “picture 1” 316, “picture 2” 317, and “picture 3” 318. “Photo album 2” 315 includes “picture 4” 319, “picture 5” 320, and “picture 6” 321. “User page 1” 310 also includes a link to “friends info” page 312. “Friends info” page 312 may include identification information for one or more online friends. “User page 1” 310 also includes a link to “blog 1” 313. A blog may allow an authorized user to post entries that one or more other users may read and respond to.
“User page 2” 330 includes a link to “blog 2” 331. “Blog 2” 331 includes “blog entry 1” 332. “Blog entry 1” 332 is linked to “blog entry 2” 333. “Blog entry 2” 333 is linked to “blog entry 3” 334, which in turn is linked to “blog entry 4” 335, which is in turn linked to “blog entry 5” 336.
“User page 3” 340 is linked to “blog 3” 341 and “photo homepage 2” 342. “Photo homepage 2” 342 is linked to “photo album 3” 343. “Photo album 3” 343 is linked to “picture 8” 344, “picture 9” 345, “picture 10” 346, and “picture 11” 347.
“User page 4” 350 is linked to “photo homepage 3” 351, “blog 4” 352, and “contact info homepage” 353. A contact info homepage may include contact information for a user. “Photo homepage 3” 351 is linked to “photo album 4” 354, “photo album 5” 355, and “photo album 6” 356. “Photo album 4” 354 is linked to “picture 12” 357, “picture 13” 358, and “picture 14” 359. “Photo album 5” 355 is linked to “picture 15” 360 and “picture 16” 361. “Photo album 6” 356 is linked to “picture 17” 362, “picture 18” 363, “picture 19” 364, and “picture 20” 365. “Blog 4” 352 is linked to “blog entry 6” 366 and “blog entry 7” 367.
Turning now to
At step 410, a set of descriptive information that describes content in one of the plurality of related documents is derived. A set of descriptive information is derived for each of the plurality of related documents resulting in a plurality of descriptive information sets. The descriptive information sets include a separate set of descriptive information for each of the plurality of related documents. In one embodiment, the descriptive information includes text on one of the related documents. The descriptive information could include metadata associated with objects such as videos or photographs on or in a document. For example, a set of descriptive information including a photograph date, a photograph description, and photograph source may be derived from metadata associated with a photograph on a web page. Other text on the web page describing the photograph, such as a caption, may be included in the descriptive information. A set of descriptive information including the text in an article may be derived from a website posting an article. The set of descriptive information describes the document and may include portions of text, and other information from the document.
At step 420, a subpart identifier that contains navigation information that allows a search engine to navigate to an individual related document associated with the subpart identifier is generated. A subpart identifier is generated for each of the plurality of related documents. In one embodiment, a subpart identifier does not contain a URL. Thus, a plurality of subpart identifiers that includes a separate subpart identifier for each of the plurality of related documents is generated. The subpart identifier may provide navigation information to a document in general, or to a portion of a document. Thus, at the conclusion of steps 410 and 420 a set of descriptive information and a corresponding subpart identifier has been generated for each of the related documents.
At step 430, the plurality of descriptive information sets and the plurality of subpart identifiers are integrated into a synthetic search document. A synthetic search document is a single search document that contains multiple subparts. Each subpart includes an individual set of descriptive information paired with a single subpart identifier that contains the navigation information for an individual document from which the individual set of descriptive information is derived. Each subpart corresponds to one of the related documents and includes a set of descriptive information and a subpart identifier.
Each page in the hierarchy corresponds to a subpart in the synthetic search document 500. “User page 1” 310 corresponds with subpart 507. Subpart 507 includes a set of descriptive information 506 describing “user page 1” 310 and subpart identifier 508 that contains navigation information for “user page 1” 310. Subpart 511 corresponds with “photo homepage 1” 311. Subpart 511 includes a set of descriptive information 510 derived from “photo homepage 1” and a subpart identifier 512 with navigation information to “photo homepage 1” 311. Subpart 515 corresponds with “friend info page” 312. Subpart 515 includes a set of descriptive information 514 describing “friend info page” 312 and subpart identifier 516 that contains navigation information to “friend info page” 312. Subpart 519 corresponds to “blog 1” 313. Subpart 519 includes a set of descriptive information 518 describing “blog 1” 313 and subpart identifier 520 that includes navigation information for “blog 1” 313. Subpart 523 corresponds with “photo album page 1” 314. Subpart 523 includes a set of descriptive information 522 that describes “photo album page 1” 314 and a subpart identifier 524 that contains navigation information to “photo album page 1” 314. Subpart 527 corresponds with “photo album page 2” 315. Subpart 527 includes a set of descriptive information 526 describing “photo album page 2” 315 and subpart identifier 528 that includes navigation information to “photo album page 2” 315. Subpart 531 corresponds to “picture 1” 316. Subpart 531 includes a set of descriptive information 530 describing “picture page 1” 316 and subpart identifier 532 that contains navigation information to “picture page 1” 316. Subpart 535 corresponds to “picture page 2” 317. Subpart 535 includes a set of descriptive information 534 describing “picture page 2” 317 and subpart identifier 536 that has navigation information for “picture page 2” 317. Subpart 539 corresponds with “picture page 3” 318. Subpart 539 includes a set of descriptive information 538 describing “picture page 3” 318 and a subpart identifier 540 that contains navigation information for “picture page 3” 318. Subpart 543 corresponds with “picture page 4” 319. Subpart 543 includes a set of descriptive information 542 describing “picture page 4” 319 and a subpart identifier 544 with navigation information to “picture page 4” 319. Subpart 547 corresponds with “picture page 5” 320. Subpart 547 includes a set of descriptive information 546 describing “picture page 5” 320 and a subpart identifier 548 with navigation information to “picture page 5” 320. Subpart 551 corresponds with “picture page 6” 321. Subpart 551 includes a set of descriptive information 550 describing “picture page 6” 321 and a subpart identifier 552 that includes navigation information to “picture page 6” 321. Thus, synthetic search document 500 includes a set of descriptive information and corresponding subpart identifiers for each picture page 7” in the hierarchy headed by “user page 1” 310.
Synthetic search document 500 also includes a header subpart 503 that includes metadata 502 and supplemental information 504. Metadata 502 may include information that identifies synthetic search document 500 to a search engine as a synthetic search document. Additional metadata information may also be included. Supplemental information 504 may include information that describes each of the documents described in synthetic search document 500. The supplemental information may be used to include additional information that describes the documents consolidated into synthetic search document 500 without modifying the underlying documents. For example, supplemental information 504 may indicate that the synthetic search document 500 is associated with a particular user. The supplemental information 504 may include buddy information indicating one or more buddies associated with user 1.
Returning now to
In one embodiment, related documents within a large group of documents are automatically determined to be related if they contain designated subject matter content. For example, web pages in a social networking site authored by a user and containing photographs could be identified as related. Embodiments of the present invention may be practiced by a website hosting a large number of pages, at least some of which are logically related. The host of the website may publish sitemaps for the synthetic search documents indicating a relationship between synthetic search documents and providing a guideline to a search engine.
Turning now to
Turning now to
Turning now to
Turning now to
At step 1020, a subpart identifier that contains navigation information that allows a search engine to navigate to an individual web page associated with the subpart identifier is generated. A subpart identifier is generated for each of the plurality of related web pages. In one embodiment, a subpart identifier does not contain a URL. Thus, a plurality of subpart identifiers that includes a separate subpart identifier for each of the plurality of related web pages is generated. The subpart identifier may provide navigation information to a web page in general, or to a portion of a web page.
At step 1030, the plurality of descriptive information sets and the plurality of subpart identifiers are integrated into a synthetic search document. A synthetic search document is a single search document that contains multiple subparts. Each subpart includes an individual set of descriptive information paired with a single subpart identifier that contains the navigation information for an individual web page from which the individual set of descriptive information is derived. A synthetic search document has been described previously with reference to
At step 1040, information may be added to each of the plurality of related web pages that indicates to a search engine that each of the plurality of related web pages should not be individually indexed. This enables the search engine to respond to a query by searching the synthetic search document rather than each of the plurality of related web pages. Step 1040 helps prevent the search engine from indexing duplicate information. Avoiding duplicate indexing may be prevented using or methods.
The present invention has been described in relation to particular embodiments, which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present invention pertains without departing from its scope.
From the foregoing, it will be seen that this invention is one well adapted to attain all the ends and objects set forth above, together with other advantages which are obvious and inherent to the system and method. It will be understood that certain features and sub-combinations are of utility and may be employed without reference to other features and sub-combinations. This is contemplated by and is within the scope of the claims.