The subject invention relates generally to computer systems, and more particularly, relates to systems and methods that employ machine learning techniques to rank and order search results from multiple search sources in order to provide a blended return of the results in terms of relevance to a search query.
Given the popularity of the World Wide Web and the Internet, users can acquire information relating to almost any topic from a large quantity of information sources. In order to find information, users generally apply various search engines to the task of information retrieval. Search engines allow users to find Web pages containing information or other material on the Internet or internal databases that contain specific words or phrases. For instance, if they want to find information about a breed of horses known as Mustangs, they can type in “Mustang horses”, click on a search button, and the search engine will return a list of Web pages that include information about this breed. If a more generalized search were conducted however, such as merely typing in the term “Mustang,” many more results would be returned such as relating to horses or automobiles associated with the same name, for example.
There are many search engines on the Web along with a plurality of local databases where a user can search for relevant information via a query. For instance, AllTheWeb, AskJeeves, Google, HotBot, Lycos, MSN Search, Teoma, and Yahoo are just a few of many examples. Most of these engines provide at least two modes of searching for information such as via their own catalog of sites that are organized by topic for users to browse through, or by performing a keyword search that is entered via a user interface portal at the browser. In general, a keyword search will find, to the best of a computer's ability, all the Web sites that have any information in them related to any key words or phrases that are specified in the respective query. A search engine site will provide an input box for users to enter keywords into and a button to press to start the search. Many search engines have tips about how to use keywords to search effectively. The tips are usually provided to help users more narrowly define search terms in order that extraneous or unrelated information is not returned to clutter the information retrieval process. Thus, manual narrowing of terms saves users a lot of time by helping to mitigate receiving several thousand sites to sort through when looking for specific information.
In addition to the type of query terms employed in a search, returned results from the search are often ranked according to a determined relevance by the search engine. Sometimes, non-relevant pages make it through in the returned results, which may take a little more analysis in the results to find what users are looking for. Generally, search engines follow a set of rules or an algorithm to order search results in terms of relevance. One of the main rules in a ranking algorithm involves the location and frequency of keywords on a web page. For instance, pages with the search terms appearing in the HTML title tag are often assumed to be more relevant than others to the topic. Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. One assumption is that any page relevant to the topic will mention those words from the beginning. Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages. Unfortunately, there is no standard for ranking documents from different search engines, whereby different search engine algorithms rank results inconsistently from one another.
One problem with current searching techniques relates to how to compare, rank, and/or display information that may have been retrieved from multiple database sources. For instance, some users may desire to query two or more internet search engines with the same query and then analyze the returned results from the respective queries. At the same time, the users may query a local or community database to determine what new information may have been generated on those sites. As can be appreciated, each site may return a plurality of results, wherein the results are ranked according to different standards per the respective sites. Consequently, it is difficult for users to determine the importance or relevance of returned information given the somewhat incompatible ranking standards that are employed by different search tools. Also, this type of searching and analysis can take particularly large amounts of time to sift through results from each site and also to manually prioritize the information received given that some sites or engines likely may rank returned documents or information sources differently. Thus, in one case, one search engine may return a more important result—given the nature of the query, farther down the list of returned results than a second search engine.
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. It is not intended to identify key/critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later.
The subject invention relates to systems and methods that utilize machine learning techniques to analyze query results from multiple search sources in order to blend results across the sources in terms of relevance. In one aspect, one or more learning components (e.g., classifiers) are adapted to search engine databases to determine relevance of information residing on a respective database. The learning components can be trained from a plurality of factors such as query term frequency appearing in a database, how recent a term has been used, time considerations, the number of times a given term has been searched for on a given database, the number of document examinations requested from the database, other metadata considerations and so forth. After training, the learning components can be employed as an overall scoring system that can be applied to multiple databases in view of a given query. For instance, a scoring or blending ratio can be determined and assigned to results from different databases or regions of a database indicating the relevance of information found therein. Upon determining the ratio, results returned from different sources can be automatically blended or mixed in display format according to the determined ratio or score. For instance, in a first database, it may be determined that the results are 2 to 1 more likely than another database that is scored as 1 to 1 given a respective query. Thus, results can be automatically blended as output to the user, in this case, the first two search results would be shown from database 1 followed by one result from database 2, followed by two results from database 1 and so forth. In this manner, results can be ranked consistently across search tools in order to mitigate the amount of time to find desired information and uncertainty in determining relevance of information from a given source. As can be appreciated, a plurality of blending ratios or scores can be determined.
To the accomplishment of the foregoing and related ends, certain illustrative aspects of the invention are described herein in connection with the following description and the annexed drawings. These aspects are indicative of various ways in which the invention may be practiced, all of which are intended to be covered by the subject invention. Other advantages and novel features of the invention may become apparent from the following detailed description of the invention when considered in conjunction with the drawings.
The subject invention relates to systems and methods that automatically combine or interleave received search results from across knowledge databases in a uniform and consistent manner. In one aspect, an automated search results blending system is provided. The system includes a search component that directs a query to at least two databases. A learning component is employed to rank or score search results that are received from the databases in response to the query. A blending component automatically interleaves or combines the results according to the rank in order to provide a consistent ranking system across differing knowledge sources and search tools. This enables searches over a variety of information types and providers—some coming from within and some from the outside a given search domain. Internally, for those searches that come from within, the search system utilizes multiple evidence factors to produce ranked retrieval. Automated combination of these multiple evidence factors results in what is referred to as “results blending” or blending results that are received from disparate ranking systems in an adaptive manner. Thus, an adaptive interleaving approach is provided to blend search results that leads to more enhanced machine learning approaches which can also be guided by user interaction data.
As used in this application, the terms “component,” “system,” “engine,” “query,” and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers. Also, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate via local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems via the signal).
Referring initially to
After training, the learning components 110 can be employed as an overall scoring system that can be applied to multiple databases 120 based a given query 130. For instance, a scoring or blending ratio can be determined and assigned to results from different databases 120 or regions of a database indicating the relevance of information found therein. Upon determining the ratio or score, results returned from different sources can be automatically blended or mixed in display format according to the determined ratio or score at the user interface 150. For instance, in a first database 120, it may be determined that the results are 3 to 1 more likely than another database that is scored as 2 to 1 given a respective query. Thus, results can be automatically blended as output by the blending component 160 for the user. In this case, the first three search results would be shown from database 1 followed by two results from database 2, followed by three results from database 1 and so forth. In this manner, results can be ranked consistently across search engines 140 and databases 120 in order to mitigate the amount of time to find desired information and uncertainty in determining relevance of information from a given source.
To illustrate some of the blending concepts described above, the following specific examples are described. In one case, to search for an answer to a problem, a user has different choices that may include a vendor database, their own computer (Local content), a corporate website, a product website, an OEM website (e.g., Dell), newsgroups, and Internet Search sites to name but a few examples. Thus, the user would select a content provider to conduct a search for information and they also may need to search in multiple places. Currently, results from different search providers cannot be compared easily. One solution is to employ 1-1 interleaving of results that are received from the databases 120. This implies that each site is represented equally (e.g., top result from site 1 ranked with top result from site 2, second result from site 1 ranked and displayed with second result from site 2 and so forth).
In accordance with the subject invention, in addition to 1-1 ranking of results from disparate information sources, intelligent blending of results can be provided which are based on the learning components 110. As will be shown in tests results below, there is value provided to users by employing intelligent blending of results over a 1-1 blending strategy. Thus, search results can be automatically presented from different content providers in a “blended” or combined format at the user interface 150. In one example, this includes providing a unified and ordered list of results at the user interface 150, regardless of where the information comes from or from which database 120.
To illustrate the basic outlines for blending the following contrasts a 1-1 strategy to a blended results strategy. As will be shown below, search results using intelligent blending (with learning) provides a more relevant data presentation than search results using 1 to 1 interleaving. In a 1-1 Interleaving strategy, results are interleaved, one from each provider in order. For instance:
Given providers a, b, c with result sets:
Rather than a straight 1-1 interleave approach, each data provider can be considered an “expert” in its own domain of knowledge as supported by the databases 120. This expertise can be exploited to influence intelligent blending as described above.
With intelligent blending, a weighted Interleaving strategy is employed by the results blending component 160 and in accordance with the learning component 110. In this case, data providers are automatically given a ranking using the numbers from a model and classifier (or other learning component) described in more detail below. For this example, given providers a, b, and c with result sets as follows:
Referring briefly to
It is noted that various machine learning techniques or models can be applied by the learning components described above. The learning models can include substantially any type of system such as statistical/mathematical models and processes for modeling data and determining results including the use of Bayesian learning, which can generate Bayesian dependency models, such as Bayesian networks, naïve Bayesian classifiers, and/or other statistical classification methodology, including Support Vector Machines (SVMs), for example. Other types of models or systems can include neural networks and Hidden Markov Models, for example. Although elaborate reasoning models can be employed in accordance with the present invention, it is to be appreciated that other approaches can also utilized. For example, rather than a more thorough probabilistic approach, deterministic assumptions can also be employed (e.g., terms falling below a certain threshold amount at a particular web site may imply by rule be given a score). Thus, in addition to reasoning under uncertainty, logical decisions can also be made regarding the term weighting and results ranking.
Turning now to
A unified display of all returned results is illustrated at 320. This includes display output of N results which are interleaved or combined according to M blending ratios, wherein N and M are positive integers, respectively. For instance, the first four results at the display 320 may be provided from computations that indicate a ratio of 4-1 for results received from a first database, whereas the next two results may be from a different data base having a ratio determined at 2-1. Assuming two databases were employed in this example, the next four results would be listed from the first database proceeded by the next two results from the second database and so forth. In this manner, results can be blended across a plurality of sources and unified at the output display 320 to provide a consistent rank of relevance across the data sources. As noted above, a plurality of databases can be analyzed via learning components and as such, a plurality or results can be interleaved at the display 320 according to the weighted ranking described above.
Before proceeding, it is noted that the user interfaces described above can be provided as a Graphical User Interface (GUI) or other type (e.g., audio or video interface providing results). For example, the interfaces can include one or more display objects (e.g., icons, result lists) that can include such aspects as configurable icons, buttons, sliders, input boxes, selection options, menus, tabs and so forth having multiple configurable dimensions, shapes, colors, text, data and sounds to facilitate operations with the systems described herein. In addition, user inputs can be provided that include a plurality of other inputs or controls for adjusting and configuring one or more aspects of the subject invention. This can include receiving user commands from a mouse, keyboard, speech input, web site, browser, remote web service and/or other device such as a microphone, camera or video input to affect or modify operations of the various components described herein.
Proceeding to 410, one or more classifiers are associated with various data sites to be searched. As noted above, other types of machine learning can be employed in addition to classifiers. At 420, the respective classifiers are trained according to the terms appearing at the data sites. This can include a plurality of factors such as term frequency, location, time factor, and/or other considerations such relationships to other terms or metadata appearing at the sites. At 430, queries having one or more terms are run at a given or selected data site. After submitting the query to the site, results from the query are scored via the classifier described at 410. This can include assigning a weight to each query term submitted to the site to determine data relevance or potential for knowledge at the selected site. Proceeding to 450, a determination is made as to whether or not to search a subsequent data site. If so, the process proceeds back to 430, runs the aforementioned query on the next data site and scores the terms for the next site at 440. If all searches have been conducted for the respective data sites at 450, the process proceeds to 460.
At 460, the returned search results which have been scored for all the sites are blended or interleaved according to the scores assigned at 440. As noted above, blending can occur according to determined ratios for each scored data site. For instance, the top K sites are first displayed in a blended results output, followed by the top L results from a second site, followed by the top M results from a third site and so forth. The second top K results from the first site are displayed, followed by the second top L results, followed by the third top M results, wherein this process continues until all results are displayed in a blended or interleaved manner. It is noted, that if results from a given site are exhausted, the blending continues from the remaining results left from the remaining sites in the proportioned ratios or ranking described above.
In one specific example, training occurs at the query logs and content providers 530, wherein four different content providers include:
a) support.company.com
b) newsgroups.company.com
c) office.company.com (ISV content) and
d) support.company.com (OEM content)
The classifier 510 then determines the probability that a given query word (or phrase) originates from a particular provider. Testing 540 can include determining the efficacy of query/results blending which can include a graphical user interface (GUI) tool for producing queries and subsequently rating results received therefrom. Analysis tools 550 can include merging components, evaluation components, and measurement components that are employed to create a unified set of results or blended sets having measured results.
Using a Blending Query component, queries were run using content from support.com mentioned above, wherein queries were are also arranged in a similar breakdown as described above. Then, each result was ranked at a given content provider described above. This process of running queries and ranking according to the probabilities shown at 700 is then repeated for each respective data site described above. After all sites have been ranked, in this example according to the term query terms “fix printer” all the rankings can be automatically merged into a blended set for results analysis.
With reference to
The system bus 918 can be any of several types of bus structure(s) including the memory bus or memory controller, a peripheral bus or external bus, and/or a local bus using any variety of available bus architectures including, but not limited to, 11-bit bus, Industrial Standard Architecture (ISA), Micro-Channel Architecture (MSA), Extended ISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB), Peripheral Component Interconnect (PCI), Universal Serial Bus (USB), Advanced Graphics Port (AGP), Personal Computer Memory Card International Association bus (PCMCIA), and Small Computer Systems Interface (SCSI).
The system memory 916 includes volatile memory 920 and nonvolatile memory 922. The basic input/output system (BIOS), containing the basic routines to transfer information between elements within the computer 912, such as during start-up, is stored in nonvolatile memory 922. By way of illustration, and not limitation, nonvolatile memory 922 can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory 920 includes random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and direct Rambus RAM (DRRAM).
Computer 912 also includes removable/non-removable, volatile/non-volatile computer storage media.
It is to be appreciated that
A user enters commands or information into the computer 912 through input device(s) 936. Input devices 936 include, but are not limited to, a pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, TV tuner card, digital camera, digital video camera, web camera, and the like. These and other input devices connect to the processing unit 914 through the system bus 918 via interface port(s) 938. Interface port(s) 938 include, for example, a serial port, a parallel port, a game port, and a universal serial bus (USB). Output device(s) 940 use some of the same type of ports as input device(s) 936. Thus, for example, a USB port may be used to provide input to computer 912, and to output information from computer 912 to an output device 940. Output adapter 942 is provided to illustrate that there are some output devices 940 like monitors, speakers, and printers, among other output devices 940, that require special adapters. The output adapters 942 include, by way of illustration and not limitation, video and sound cards that provide a means of connection between the output device 940 and the system bus 918. It should be noted that other devices and/or systems of devices provide both input and output capabilities such as remote computer(s) 944.
Computer 912 can operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 944. The remote computer(s) 944 can be a personal computer, a server, a router, a network PC, a workstation, a microprocessor based appliance, a peer device or other common network node and the like, and typically includes many or all of the elements described relative to computer 912. For purposes of brevity, only a memory storage device 946 is illustrated with remote computer(s) 944. Remote computer(s) 944 is logically connected to computer 912 through a network interface 948 and then physically connected via communication connection 950. Network interface 948 encompasses communication networks such as local-area networks (LAN) and wide-area networks (WAN). LAN technologies include Fiber Distributed Data Interface (FDDI), Copper Distributed Data Interface (CDDI), Ethernet/IEEE 802.3, Token Ring/IEEE 802.5 and the like. WAN technologies include, but are not limited to, point-to-point links, circuit switching networks like Integrated Services Digital Networks (ISDN) and variations thereon, packet switching networks, and Digital Subscriber Lines (DSL).
Communication connection(s) 950 refers to the hardware/software employed to connect the network interface 948 to the bus 918. While communication connection 950 is shown for illustrative clarity inside computer 912, it can also be external to computer 912. The hardware/software necessary for connection to the network interface 948 includes, for exemplary purposes only, internal and external technologies such as, modems including regular telephone grade modems, cable modems and DSL modems, ISDN adapters, and Ethernet cards.
What has been described above includes examples of the subject invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the subject invention, but one of ordinary skill in the art may recognize that many further combinations and permutations of the subject invention are possible. Accordingly, the subject invention is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term “includes” is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.