The subject invention relates generally to computer systems, and more particularly, relates to systems and methods that employ relevance classification techniques on a data log of previous search results to enhance the quality of current search engine results.
Given the popularity of the World Wide Web and the Internet, users can acquire information relating to almost any topic from a large quantity of information sources. In order to find information, users generally apply various search engines to the task of information retrieval. Search engines allow users to find Web pages containing information or other material on the Internet that contain specific words or phrases. For instance, if they want to find information about George Washington, the first president of the United States, they can type in “George Washington first president”, click on a search button, and the search engine will return a list of Web pages that include information about this famous president. If a more generalized search were conducted however, such as merely typing in the term “Washington,” many more results would be returned such as relating to geographic regions or institutions associated with the same name.
There are many search engines on the Web. For instance, AllTheWeb, AskJeeves, Google, HotBot, Lycos, MSN Search, Teoma, Yahoo are just a few of many examples. Most of these engines provide at least two modes of searching for information such as via their own catalog of sites that are organized by topic for users to browse through, or by performing a keyword search that is entered via a user interface portal at the browser. In general, a keyword search will find, to the best of a computer's ability, all the Web sites that have any information in them related to any key words and phrases that are specified. A search engine site will have a box for users to enter keywords into and a button to press to start the search. Many search engines have tips about how to use keywords to search effectively. The tips are usually provided to help users more narrowly define search terms in order that extraneous or unrelated information is not returned to clutter the information retrieval process. Thus, manual narrowing of terms saves users a lot of time by helping to mitigate receiving several thousand sites to sort through when looking for specific information.
One problem with current searching techniques is the requirement of manual focusing or narrowing of search terms in order to generate desired results in a short amount of time. Another problem is that search engines operate the same for all users regardless of different user needs and circumstances. Thus, if two users enter the same search query they get the same results, regardless of their interests, previous search history, computing context, or environmental context (e.g., location, machine being used, time of day, day of week). Unfortunately, modern searching processes are designed for receiving explicit commands with respect to searches rather than considering these other personalized factors that could offer insight into the user's actual or desired information retrieval goals.
From Web search engines to desktop application utilities (e.g., help systems), users consistently utilize information and retrieval systems to discover unknown information about topics of interest. In some cases, these topics are prearranged into topic and subtopic areas. For example, “Yahoo” provides a hierarchically arranged predetermined list of possible topics (e.g., business, government, science, etc.) wherein the user will select a topic and then further select a subtopic within the list. Another example of predetermined lists of topics is common on desktop personal computer help utilities wherein a list of help topics and related subtopics are provided to the user. While these predetermined hierarchies may be useful in some contexts, users often need to search for/inquire about information that is hard to find by following the topic structures or is outside of and/or not included within these predetermined lists. Thus, search engines or other search systems are often employed to enable users to direct user-crafted queries in order to find desired information. Unfortunately, this often leads to frustration when many unrelated files are retrieved since users may be unsure of how to author or craft a particular query. This often causes users to continually modify queries in order to refine retrieved search results to a reasonable number of files. For those who are not familiar with computer techniques, this can be very difficult. As a result, they may not be able to find what they want.
As an example of this dilemma, it is not uncommon to type in a word or phrase in a search system input query field and retrieve several thousand files—or millions of web sites in the case of the Internet, as potential candidates. In order to make sense of the large volume of retrieved candidates, the user will often experiment with other word combinations to further narrow the list since many of the retrieved results may share common elements, terms or phrases yet have little or no contextual similarity in subject matter. This approach is inaccurate and time consuming for both the user and the system performing the search. Inaccuracy is illustrated in the retrieval of thousands if not millions of unrelated files/sites the user is not interested in. Time and system processing speed are also sacrificed when searching massive databases for possible yet unrelated files.
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. It is not intended to identify key/critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later.
The subject invention relates to systems and methods that employ data mining and learning techniques to facilitate efficient searching, retrieval, and analysis of information. In one aspect, a learning component such as Bayesian classifier, for example, is trained from a log that stores information from a plurality of past user search activities. For instance, the learning component can determine whether or not certain returned results in the log are more relevant or not to users by analyzing implicit or explicit data within the logs, wherein such data indicates the relevance or quality of search results or subset of results. In one specific example, it may be determined that given a set of returned search results that users have dwelled (e.g., spent more time) on certain types of results—indicating higher relevance, than other types of results given the nature of the initial search query. Over time, the learning component can be trained from the past search activities and employed as a run-time classifier with a search engine to filter or determine the most relevant results from a user's submitted query to the engine. In this manner, by automatically classifying results that are more likely relevant to a user, information search processes can be enhanced by mitigating the amount of time for users to locate desired information.
Various analytical techniques can be employed to train learning components and facilitate future information retrieval processes. This can include analyzing the number of times users have actually selected a result to determine its relevance in view of a given query. Rather than require the user to provide explicit feedback as to relevance, implicit factors such as how many times a particular result was opened, how much time was spent with a file linked to a result or how far the user drilled-down into a particular file. In this manner, relevance can be automatically determined without further burdening users to explicitly inform the system as to what results may be relevant and those which are not. Sequential analysis techniques can be applied to previously failed queries to automatically enhance future queries. Other relevance factors for refining future queries and resolving ambiguities include analyzing extrinsic or contextual data such as operating system version, the type of application used, hardware settings and so forth. This can include a consideration of variables such as seasonal or time sensitive information into a query to facilitate that more relevant results are returned.
To the accomplishment of the foregoing and related ends, certain illustrative aspects of the invention are described herein in connection with the following description and the annexed drawings. These aspects are indicative of various ways in which the invention may be practiced, all of which are intended to be covered by the subject invention. Other advantages and novel features of the invention may become apparent from the following detailed description of the invention when considered in conjunction with the drawings.
The subject invention relates to systems and methods that automatically learn data relevance from past search activities and apply such learning to facilitate future search activities. In one aspect, an automated information retrieval system is provided. The system includes a learning component that analyzes stored information retrieval data to determine relevance patterns from past user information search activities. A search component (e.g., search engine) employs the learning component to determine a subset of current search results based at least in part on the relevance patterns. Numerous variables can be processed in accordance with the learning component including search failure data, relevance data, implicit data, system data, application data, hardware data, contextual data such as time-specific information, and so forth in order to efficiently generate focused, prioritized, and relevant search results.
As used in this application, the terms “component,” “system,” “engine,” “query,” and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers. Also, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate via local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems via the signal).
Referring initially to
Classifiers (e.g., runtime classifiers) generated using machine learning techniques such as a Naive Bayesian model on end-user search data logs 120 can be employed together with an information retrieval (IR) component to form a highly relevant search engine. In one aspect, relevance data is determined from the log 120 by identifying user satisfied search results to train runtime classifiers. Currently, some systems process all clicks or selections on search results as satisfied by the user. Experiments show that about ⅓ of time when users selected a result they are actually satisfied with the selection. Therefore, training on “satisfied” clicks or selections will lead to optimized classifiers. To know whether a click is satisfied, users can be asked for their explicit feedback. However, in many situations, only a small percentage of users provide explicit feedback. To get feedback on all clicks, the system 100 can use clicks with explicit feedbacks to build another classifier that maps user behavior data (e.g., the time a user spent on a result, where they go from this result, some meta data on the result itself) to the explicit feedback. This classifier is referred to as a relevance classifier. Then, apply the relevance classifier on the clicks/results that users didn't provide explicit feedback to infer their satisfactions. This technique provides high quality data to train runtime classifiers.
During searches, when one query 160 does not provide satisfied results, a user may revise the query and resubmit it. They may repeat this process, until one satisfied result is returned. Various data mining techniques can be employed such as sequential analysis to analyze user search log data 120 and link failed queries (the queries that do not have satisfied results) to the satisfied results of their revised queries, and include these linked data into the training data for the runtime classifiers of the learning component 110. When the new runtime classifiers are deployed on a search server, for instance, users receive satisfied results 150 on the queries that were not satisfied with the conventional search engine that did not employ the classifiers or the earlier version of the search server (before deploying the new runtime classifiers).
Other considerations include training runtime classifiers using only terms in query strings. However, the classifier can be enhanced when including extra input variables such as operation system version, application used, hardware settings including whether a printer is linked or whether a digital camera is linked, for example. This extra information aids the runtime classifier to solve potential ambiguities thus providing improved result predictions. Still yet other predictions include providing query mapping for handling contextual data such as seasonal/time sensitive contexts, for example. During query processing stages, mapping seasonal/time sensitive queries to a version with time information using Lexical services in one instance. For example, when time is close to 2005, map “Calendar” to “Calendar Calendar-2005”. This will improve the chance that Calendar 2005 appears on the top of a result list in the relevance results 150.
It is noted that various machine learning techniques or models can be applied by the learning component 110 to process the data log 120 over time. The learning models can include substantially any type of system such as statistical/mathematical models and processes for modeling users and determining results including the use of Bayesian learning, which can generate Bayesian dependency models, such as Bayesian networks, naïve Bayesian classifiers, and/or other statistical classification methodology, including Support Vector Machines (SVMs), for example. Other types of models or systems can include neural networks and Hidden Markov Models, for example. Although elaborate reasoning models can be employed in accordance with the present invention, it is to be appreciated that other approaches can also utilized. For example, rather than a more thorough probabilistic approach, deterministic assumptions can also be employed (e.g., no dwelling for X amount of time of a particular web site may imply by rule that the result is not relevant). Thus, in addition to reasoning under uncertainty, logical decisions can also be made regarding the status, location, context, interests, focus, and so forth.
Learning models can be trained from a user event data store (not shown) that collects or aggregates contextual data from a plurality of different data sources. Such sources can include various data acquisition components that record or log user event data (e.g., cell phone, acoustical activity recorded by microphone, Global Positioning System (GPS), electronic calendar, vision monitoring equipment, desktop activity, web site interaction and so forth). It is noted that the system 100 can be implemented in substantially any manner that supports personalized query and results processing. For example, the system could be implemented as a server, a server farm, within client application(s), or more generalized to include a web service(s) or other automated application(s) that interact with search functions such as a user interface (not shown) for the search engine 140.
Proceeding to 210 of
At 240 new queries submitted by a user or system are analyzed by a search tool having a trained classifier operate therewith. This can include analyzing various contextual sources such as application data, hardware data, time data, seasonal data, calendar data, system data, file meta data, and so forth to further refine a respective query to produce relevance search results. At 250, search results subsets that have been determined from the trained classifiers and/or contextual data considerations are generated and provided to a user. This can include generating an output display via a user interface if desired. As can be appreciated, relevance results that have been generated in accordance with the present invention can be further analyzed (e.g., provide further training to a classifier) and thus, operate as nested opportunities for training or relevance refinement.
Turning to
To train the relevance classifier 300, a set of data is employed with both implicit feedback and explicit feedback at result level (each entry in the data set represent a result of a search)(can link to multiple interactions to the result from a user in a single search session, or a visit to an asset from a user browsing). The classifier is then used to infer the explicit feedback of a user on a result using implicit feedback when the explicit feedback on the result is not available, for example. In one case, decision tree learning can be employed for the relevance classifiers 300 but other types of learning are also possible.
At 310, components for building and using the relevance classifier 300 is described as follows:
At 320, schema considerations for processing relevance classifiers are shown in the case of saving relevance classifiers in a data base. For example, generated relevance classifiers 300 can be loaded into a table in a database and subscribe to the following schema attributes such as: a ClassifierID (unique id), a GUID, a Classifier Name, a Description, a Status (active or inactive), a Scope (e.g., software version), other Version information, a Training Set Size, and Classifier (XML string). Another table can include User Relevance Factor storing the factors used by classifiers including UsedRelevanceFactorID (unique id), ClassifierID, and FactorTypeID.
At 540, if the runtime classifier did not pass the evaluation at 530, indicate this and proceed to 550 for diagnostics. Otherwise, indicate satisfaction with the runtime classifier (The system creates a final classifier for publishing at this time by combing the training set, regression set and the internal diagnostics set). If the evaluation did not pass at 540, proceed to 550 and diagnose the classifier by providing the following information, and then a diagnostics report will be created. The information includes a Runtime classifier ID (The same date range as for the training can be used here). At 560, read the diagnostics report and take actions to change the training data. Then, go back to 510 to recreate a new runtime classifier. Note that the training data should be changed at this point. At 570, the runtime classifier is ready for publishing to the search engine to deploy. It is noted that in 500, some acts can be automated. Runtime classifiers and their meta data can be saved in a data base shared by all the processes in 500.
Wuser*User_annotated_data∪Wauthor*Author_annotated_data
where, Wuser is the weight given to each pair in the user annotated data 610, and Wauthor is the weight given to each pair in the author annotated data 620.
The system 800 provides an Application Programming Interface (API) 830 for a user interface (UI) component 840 and a command tool 850 for building a runtime classifier using a specified training set and to save the generated model into the Model Store 820. The system 800 shows the control flow and data flow inside a Model Builder component 860 and its interaction with other components. The Model Builder 860 processes a set of parameters defining the source of training data, then decides where and how to extract the training data. For end user annotated queries from the Relevance Mart 810, its Data Reader extracts the raw data, and then Event Constructor converts the raw data into events in the format as follows that is requested by the NaiveBayes classifier trainer: Asset_ID; Frequency; and Features.
Typically, features include query string terms however other type of features can be added. An event list 864 is passed to a NaiveBayes classifier trainer 870 (SparseNB) to generate a runtime classifier. A Data Writer 874 stores the generated classifier model to the Model Store 820 together with meta data information. The API 830 includes the following parameters: Data source: 3 possible values: user annotated queries, author annotated queries, or both; Catalog: a catalog for training the classifier; a Date range: start date time and end date time for selecting training data; and a Minimum prediction confidence. An event generator 880 converts raw data from a data reader 890. This includes converting to lower case (some cultures only) and phrase matching at the client side, as well as word breaking, stemming, query expansion, statistical spell checking, and noise words at server side, for example.
With reference to
The system bus 918 can be any of several types of bus structure(s) including the memory bus or memory controller, a peripheral bus or external bus, and/or a local bus using any variety of available bus architectures including, but not limited to, 11-bit bus, Industrial Standard Architecture (ISA), Micro-Channel Architecture (MSA), Extended ISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB), Peripheral Component Interconnect (PCI), Universal Serial Bus (USB), Advanced Graphics Port (AGP), Personal Computer Memory Card International Association bus (PCMCIA), and Small Computer Systems Interface (SCSI).
The system memory 916 includes volatile memory 920 and nonvolatile memory 922. The basic input/output system (BIOS), containing the basic routines to transfer information between elements within the computer 912, such as during start-up, is stored in nonvolatile memory 922. By way of illustration, and not limitation, nonvolatile memory 922 can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory 920 includes random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and direct Rambus RAM (DRRAM).
Computer 912 also includes removable/non-removable, volatile/non-volatile computer storage media.
It is to be appreciated that
A user enters commands or information into the computer 912 through input device(s) 936. Input devices 936 include, but are not limited to, a pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, TV tuner card, digital camera, digital video camera, web camera, and the like. These and other input devices connect to the processing unit 914 through the system bus 918 via interface port(s) 938. Interface port(s) 938 include, for example, a serial port, a parallel port, a game port, and a universal serial bus (USB). Output device(s) 940 use some of the same type of ports as input device(s) 936. Thus, for example, a USB port may be used to provide input to computer 912, and to output information from computer 912 to an output device 940. Output adapter 942 is provided to illustrate that there are some output devices 940 like monitors, speakers, and printers, among other output devices 940, that require special adapters. The output adapters 942 include, by way of illustration and not limitation, video and sound cards that provide a means of connection between the output device 940 and the system bus 918. It should be noted that other devices and/or systems of devices provide both input and output capabilities such as remote computer(s) 944.
Computer 912 can operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 944. The remote computer(s) 944 can be a personal computer, a server, a router, a network PC, a workstation, a microprocessor based appliance, a peer device or other common network node and the like, and typically includes many or all of the elements described relative to computer 912. For purposes of brevity, only a memory storage device 946 is illustrated with remote computer(s) 944. Remote computer(s) 944 is logically connected to computer 912 through a network interface 948 and then physically connected via communication connection 950. Network interface 948 encompasses communication networks such as local-area networks (LAN) and wide-area networks (WAN). LAN technologies include Fiber Distributed Data Interface (FDDI), Copper Distributed Data Interface (CDDI), Ethernet/IEEE 802.3, Token Ring/IEEE 802.5 and the like. WAN technologies include, but are not limited to, point-to-point links, circuit switching networks like Integrated Services Digital Networks (ISDN) and variations thereon, packet switching networks, and Digital Subscriber Lines (DSL).
Communication connection(s) 950 refers to the hardware/software employed to connect the network interface 948 to the bus 918. While communication connection 950 is shown for illustrative clarity inside computer 912, it can also be external to computer 912. The hardware/software necessary for connection to the network interface 948 includes, for exemplary purposes only, internal and external technologies such as, modems including regular telephone grade modems, cable modems and DSL modems, ISDN adapters, and Ethernet cards.
What has been described above includes examples of the subject invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the subject invention, but one of ordinary skill in the art may recognize that many further combinations and permutations of the subject invention are possible. Accordingly, the subject invention is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term “includes” is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.