Most current search engines use keyword-based searching to locate web pages or online information on the World Wide Web (Web). The search engines use web crawlers to traverse online web pages and categorize the web pages' content into inverted indexes. An inverted index is an index data structure that stores a mapping of keywords to online documents where the keywords have been located by a web crawler. An entry in an inverted index contains a keyword and a list of documents that contain the keyword of interest. When a user issues a query such as “dentists in Seattle Wash.” to the search engine, the search engine can quickly retrieve the list of online documents containing these four keywords by looking up the inverted index.
Most keyword-based search engines operate on the assumption that the user intends to only find documents that contain all of the search terms. Conventional search engines answer submitted queries by locating documents containing every keyword submitted. This is typically referred to as “and-based searching.” When a user over-specifies a query by including unnecessary terms, however, a relevant document that is missing one or more of the extra terms will not be located. In the above example, the inverted index may only specify documents that include the keywords “dentists” and “Seattle” but not “in” and “Washington.” Consequently, the search engine will not return documents that do not include all four keywords.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
One aspect of the invention is directed to locating web documents that satisfy a subset of the words in a search-engine query. Once a user submits the query to a search engine, the search engine parses the query into keywords and determines whether a subset of the keywords have been found by a web crawler in any online documents. To do so, the search engine may query the words against an inverted index of terms found by a web crawler and check the documents the terms were found in. Also, some keywords in the search-engine query may be designated as “non-relaxed” keywords. Non-relaxed keywords, if specified, must be included in any document identified as matching the query. The search engine returns the identified documents in a search-results list.
Another aspect of the invention is directed to a server configured to return the above search-results list. The server is configured to receive the search-engine query from the client computing device, parse the query into keywords the inverted index to determine whether any documents contain the subset of keywords. The server may also be configured to only locate documents that also contain any non-relaxed keywords.
The present invention is described in detail below with reference to the attached drawing figures, wherein:
The subject matter described herein is presented with specificity to meet statutory requirements. The description herein, however, is not intended to limit the scope of this patent. Instead, it is contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the term “block” may be used herein to connote different elements of methods employed, the term should not be interpreted as implying any particular order among or between various steps herein disclosed.
In general, embodiments described herein are directed toward a search engine that creates a list of results for a search-engine query by identifying documents that include only a subset of the keywords submitted by a user. In one embodiment, once the user submits the search-engine query, the search engine checks an inverted index to locate documents that contain each separate keyword in the query. The identified documents for each word may then be compared to see if the documents contain any of the other keywords. Only documents containing a subset of the keywords is identified for the results list. The subset of keywords equals the total number of keywords (N) minus a given number (K) less than N, resulting in the subset equaling N−K words long. For example, if a query contained “Seattle dentists in Washington,” and K was equal to 1, documents would only have to include any three of the above words to be included on the results list. K can vary by any number and can be set either by an administrator of the search engines or by the search engine automatically using well-known heuristics. For the sake of clarity, N minus K is represented herein as N−K.
In an alternative embodiment, the search engine may be configured to only search for web documents containing a lesser number of words (M) in a given query of N words, with M<N. For example, looking again at the above query, the search engine may be configured in this embodiment to search for documents that have any two or three of the words “Seattle,” “dentists,” “in,” and “Washington.” Thus, in this embodiment, any M words of the query may be matched across web documents.
A search-engine query, as discussed herein, refers to any keyword search of the Web by a search engine. Web-search queries may be initiated in any number of ways well known to those skilled in the art. For example, a user may enter keywords or phrases into a text field on a search engine's web page or into a text field of a web browser's tool bar. It will be apparent to those skilled in the art that numerous ways for initiating a search-engine query are also possible and need not be discussed at length herein. While embodiments discussed herein refer to accessing web pages via the Internet, other embodiments may access electronic documents via a private network.
In one embodiment, the present invention takes the form of a computer-program product that includes computer-useable instructions embodied on one or more computer-readable media. Computer-readable media include both volatile and nonvolatile media, removable and nonremovable media, and contemplates media readable by a database, a switch, and various other network devices.
By way of example, and not limitation, computer-readable media comprise computer-storage media. Computer-storage media, or machine-readable media, include media implemented in any method or technology for storing information. Examples of stored information include computer-useable instructions, data structures, program modules, and other data representations. Computer-storage media include, but are not limited to, random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory used independently from or in conjunction with different storage media, such as, for example, compact-disc read-only memory (CD-ROM), digital versatile discs (DVD), holographic media or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. These memory components can store data momentarily, temporarily, or permanently.
Having briefly described a general overview of the embodiments described herein, an exemplary operating environment is described below. Referring initially to
Embodiments may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a PDA or other handheld device. Generally, program modules including routines, programs, objects, components, data structures, and the like, refer to code that performs particular tasks or implements particular abstract data types. Embodiments described herein may be practiced in a variety of system configurations, including handheld devices, consumer electronics, general-purpose computers, more specialty computing devices, etc. Embodiments described herein may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
With continued reference to
Computing device 100 typically includes a variety of computer-readable media. By way of example, and not limitation, computer-readable media may comprise Random Access Memory (RAM); Read Only Memory (ROM); Electronically Erasable Programmable Read Only Memory (EEPROM); flash memory or other memory technologies; CDROM, digital versatile disks (DVD) or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, carrier wave or any other medium that can be used to encode desired information and be accessed by computing device 100.
Memory 112 includes computer-storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, nonremovable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, cache, optical-disc drives, etc. Computing device 100 includes one or more processors that read data from various entities such as memory 112 or I/O components 120. Presentation component(s) 116 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, etc.
I/O ports 118 allow computing device 100 to be logically coupled to other devices including I/O components 120, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc.
Before proceeding further, a number of key words and phrases should be defined. As alluded to above, an “inverted index” is an index data structure that includes a mapping of keywords identified by a web crawler to online documents.
When embodiments described herein are applied, the inverted index is used by a search engine to identify documents containing keywords in a submitted search-engine query. Documents containing a subset of the keywords in the query are returned to the submitting user. For example, if the query contained keywords KW1-KW6 and the subset was set to N−1 words (i.e., only 5 of 6 words need to be in a document), only D2 would be returned.
Moreover, inverted indexes store locations of documents containing particular keywords. The inverted indexes may also be configured to store additional information relating to either the keyword or the documents. For keywords, the part of speech of an instance of the keyword may be stored—e.g., if the keyword was being used as a noun, verb, adjective, etc. Additionally, alternative spellings may also be stored for the keyword. Examples of the additional information that may be stored for the documents include, without limitation, document identifiers, document URLs, metadata, meta tags, or the like. One skilled in the art will appreciate that various data may be stored to designate particular keywords and documents; therefore, such data need not be discussed at length herein.
The inverted indexes described herein may be a record-level inverted index that contains a list of references to documents for each listed keyword or a word-level inverted index that contains the positions of each keyword within a document. Embodiments may also employ a hybrid of both types.
Keywords, as used herein, are not limited to natural language words. Additionally, keywords may include abbreviations, acronyms, numbers, names, and phrases. For example, a keyword may be “inc.,” “SMTP,” “40,” “John,” or “sign of peace.” While mention is made herein to actual words, any of the above can be used instead.
The term “documents” refers to actual documents, web pages, multimedia (e.g., audio, video, images), or the like that are searchable using a search engine. Documents may be located on networks (e.g., the Internet), within databases, or stored locally on a computing device (e.g., on a local drive, virtual hard drive, or other storage media).
“Relaxed searching” refers to searching for documents that match a subset of the total number of keywords submitted in a search-engine query. Using the terminology above, a subset, in relation to relaxed searching, comprises N−K keywords, with 1≦K<N. This type of searching is referred to as “relaxed,” because it does not require a document to contain all keywords in the search-engine query to be returned within a results list. The identified documents (i.e., those containing N−K keywords) can eventually be listed and presented to the user in a search-results list.
Components of the search-engine server 300 and the information databases 304 may include, without limitation, a processing unit, internal system memory, and a suitable system bus for coupling various system components, including one or more databases for storing information (e.g., files and metadata associated therewith). Each server typically includes, or has access to, a variety of computer-readable media.
While the search-engine server 302 is illustrated as a single box, one skilled in the art will appreciate that the search-engine server 302 is scalable. For example, the search-engine server 302 may actually include multiple servers operating various portions of the software described below. The single unit depictions are meant for clarity, not to limit the scope of embodiments in any form.
In operation, the search-engine server 302 hosts a search engine designed to receive queries from remote computing devices (such as the client computing device 300) and locate information on the Web or within a private network to satisfy the queries. A query is request for documents on the Web that contains specific keywords or phrases. In some embodiments, the search engine executing on the search-engine server 302 uses continually updated inverted indexes—created by web crawlers—to quickly locate web pages satisfying a query. Once the web pages are located, their URLs are transmitted back to the client computing device 202 and displayed as hyperlinks. To access a located web page, a user need only select the corresponding hyperlink. One skilled in the art will appreciate that various other techniques exist for mining information on the Web.
Documents are stored on information databases 304 and accessible via the network 305 using a transfer protocol and relevant URL. The client computing device 300 may fetch a web page by requesting the URL using the transfer protocol. As a result, the web page can be downloaded to the client computing device 300 and stored in memory. The stored web page can then be read by a web browser and presented to a user.
The client computing device 300 may be any type of computing device, such as device 100 described above with reference to
The client computing device 300 may be equipped with a web browser. The web browser is a software application enabling a user to display and interact with information located on the Web. In an embodiment, the web browser communicates with the search-engine server 300 and the information databases 304 using a transfer protocol to fetch documents. Documents may be located by the web browser by sending the transfer protocol and the URL. The web browser can also render pages a number of markup languages (e.g., hypertext markup language (HTML) and extensible markup language (XML)) and execute various scripting languages (e.g., SilverLight™, JavaScript, Flash, Visual Basic Scripting Edition (VBScript), or the like).
The user may navigate to the search engine's web site using the web browser. Once at the web site, the user can submit keywords to the search engine, and the client computing device 300, in turn, transmits the keywords to the search engine server 302. Of course, submitting a query to a search engine is more complicated; however, the communication of queries to waiting instances of a search engine will be readily apparent to those skilled in the art, and thus need not be discussed herein.
In one embodiment, the search engine server 302 receives the query and parses the query into one or more keywords. The search engine server 302 searches one or more inverted indexes for documents that contain N−K keywords. The located documents (i.e., those containing N−K words) are listed in a search-results list and transmitted by the search engine server 302 to the client computing device 300 for display to the user.
In one embodiment, the inverted index is prepared by web crawlers browsing documents stored in the information databases 304. The information databases 304 represent servers that are storing various online documents. For example, the information databases 304 may be hosting a web page comprising numerous online documents.
Network 305 may include any computer network or combination thereof. Examples of computer networks configurable to operate as network 305 include, without limitation, a wireless network, landline, cable line, fiber-optic line, local area network (LAN), wide area network (WAN), metropolitan area network (MAN), or the like. Network 305 is not limited, however, to connections coupling separate computer units. Rather, network 305 may also comprise subsystems that transfer data between servers or computing devices. For example, network 305 may also include a point-to-point connection, the Internet, an Ethernet, a backplane bus, an electrical bus, a neural network, or other internal system.
In an embodiment where network 305 comprises a LAN networking environment, components are connected to the LAN through a network interface or adapter. In an embodiment where network 305 comprises a WAN networking environment, components use a modem, or other means for establishing communications over the WAN, to communicate. In embodiments where network 305 comprises a MAN networking environment, components are connected to the MAN using wireless interfaces or optical fiber connections. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets, and the Internet. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may also be used.
Moreover, communication across network 305 may require the illustrated devices to use a communications protocol. Examples of such protocols include, with limitation, the hypertext transfer protocol (HTTP), transmission control protocol (TCP/IP), or the like. One skilled in the art will understand the various protocols that may be used to communicate across network 305; therefore, such protocols need not be discussed at length herein.
In another embodiment, certain keywords in the search-engine query may be designated not to be relaxed, meaning all retrieved documents must include the non-relaxed word. Taking the above example again, “Seattle” in the query “dentists in Seattle Wash.” may be specified not to be relaxed. Consequently, the inverted indexes are analyzed for documents that contain “Seattle” as one of the N−K terms. The following code, or a variant thereof, could be used to designate a non-relaxed keyword class.
And the following code or a variant thereof could be used to specify a non-relaxed word in a query.
In operation, a user accesses a web site for the search engine using a web browser 306 on the client computing device 300. The user may enter and submit a search-engine query A on the web site, which in turn transmitting the search-engine query A to search engine server 302. In one embodiment, the front end 308 comprises a parser 312, which is software that splits the search-engine query A into individual keywords B. Or the parser 312 may split the search-engine query 312 into phrases of multiple keywords.
The keywords B are passed to one or more inverted indexes 314 on the back end 310. In one embodiment, the back end 310 traverses the entries in the inverted indexes 314 to attempt to locate the keywords. The inverted indexes 314 indicate documents 318 that contain the entries listed in the inverted indexes 314. As previously mentioned, each entry comprises a keyword (not to be confused necessarily with the keywords B) and all of the documents 318 in which the keyword has been located by a web crawler 316. Various information (e.g., document identifiers, URLs, internet protocol (IP) addresses, etc.) for each identified document 318 may be stored in the inverted indexes 314 in association with the keyword.
In one embodiment, the back end 310 searches the inverted indexes 314 for the keywords. In this embodiment, the back end 310 transfers a list of documents D that contain at least one of the keywords B. For example, documents D for keywords “dentists in Seattle Wash.” may include all the documents 318 containing “dentists,” “in,” “Seattle,” and “Washington.” In one embodiment, a relaxed aggregator 320, which is a portion of software executing on the back end 310, searches the documents D for documents that contain N−K keywords B (referred to as documents E).
Documents E (i.e., documents with N−K keywords B) are passed to a results generator 322 on the front end 308. The results generator 322 creates a search-results list F that includes documents E, i.e., those containing N−K of keywords B. For example, URLs for the most frequently accessed documents may be given priority on the list. Alternatively, geographically relevant results, based on the geographic location of the client computing device 300—as determined, for example, by a reverse IP address or global positioning system (GPS) device. One skilled in the art will understand that other alternatives are also possible and need not be discussed at length herein. Eventually, the search-results list F is transmitted to the client computing device 300 and displayed to the user in the web browser 306.
The back end 310 is also configured to operate a web crawler 316 for traversing documents 318 and update the inverted index 314. New entries may be added, existing entries updated, or stale entries deleted. This web crawler 316 may operate on a parallel thread to the relaxed aggregator 320. One skilled in the art will understand web crawlers in detail; therefore, they need not be discussed at length herein.
Although the subject matter has been described in language specific to structural features and methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. For example, sampling rates and sampling periods other than those described herein may also be captured by the breadth of the claims.