The search engine has become an indispensable tool for users to seek information from the World Wide Web (Web) or other database. Maximizing user satisfaction with search results received in response to a search query is always an important goal for a search engine. Understanding the intent behind a user's query, retrieving search results according to this intent, and organizing search result pages well can help a search engine improve user satisfaction. By discovering possible search intents (the intent or intention of the user when initiating a search), and associating these intents to a search query, search results can be improved.
Most users tend to use short queries when submitting a search query. Sometimes users use short queries because they do not know how to describe what they want to know. Other times users enter short queries because they are broadly interested in a subject and they are willing to browse related information. It is hard for a search engine to discern the intent of a user, especially for short queries.
Sometimes the user's intent can be manually inferred by a human being with prior knowledge of the subject being searched. Existing search engines usually manually define search intents, like “travel”, “person name”, and then classify queries to those predefined intents. This is called query-to-intent classification. This kind of approach is obviously limited by the breadth of intents which are manually defined by editors. For example, one search intent corresponding with a general concept, like “travel”, may cover a large number of queries but lose some specific aspects of a particular query, say “bellagio casino” which should be precisely associated with an accommodation intent. Defining many specific intents, however, involves much human effort and significantly increases the difficulty of classifying queries to those intents. Machine learning of user's search intent can be challenging. This is particularly true for short queries because the information inferable by a short query is very limited.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The automatic search intent mining technique described herein automatically mines search intents for a group of queries. In one embodiment, the technique is based on the assumption that a group of queries may share some common intents which can be automatically extracted. The technique leverages knowledge of query log data in order to determine search intent. Query log data is usually collected by search engine companies and includes recorded historical queries and associated search results submitted to a search engine by one or more users. A query log typically consists of a sequence of search actions, one per user query, each describing the following information: 1) terms that compose a query, 2) documents returned by the search engine, 3) documents that have been clicked, 4) the rank of those documents in the list of search results (usually based on relevancy), 5) date and time the search action/click took place and 6) an anonymous identifier for each session, among other data.
The automatic search intent mining technique, in one embodiment, utilizes three kinds of information sources: Web page content, Web page structure, and search engine query log data to mine intents for a group of queries. In one embodiment of the technique, the three data sources are used separately to mine candidate search intents for each of the three sources. The candidate search intents extracted from each of the three sources are then integrated to form the final search intents. These search intents can be used to obtain better search results for subsequent queries.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of the automatic search intent mining technique, reference is made to the accompanying drawings, which form a part thereof, and which show by way of illustration examples by which the automatic search intent mining technique described herein may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.
The following sections provide an overview of the automatic search intent mining technique, as well as exemplary processes and an architecture for employing the technique.
Once obtained, the three types of search intent candidates can then be integrated and common search intent candidates can be selected as the final search intents. These search intents can then, for example, be used to provide the user with additional or alternative search results or to focus a user's searching. Or the final search intents can be used to discover what subject matter users are searching for and to use such information to embed key phrases in Websites to attract users.
Thus, by discovering related information and related queries from the search query logs, the automatic search intent mining technique is able to leverage the knowledge of search engine users who have submitted these queries to help understand the input query.
It should be noted that although the technique can operate in fully automatic mode, it can also be used in a semi-automatic mode in one embodiment. For example, it can be employed with human editors to verify result quality. In this case, once the technique obtains a ranked list of intent candidates by applying the automatic search intent mining technique, human judges can be asked to check the candidates and to remove noisy/duplicate candidates or to add/delete some words from candidate phrases.
As shown in block 102, a group of search queries and associated query logs are input. A first set of search intent candidates for the group of search queries is mined by using Web page content of search results returned in response to the search queries, as shown in block 104. This generally involves extracting common concepts from the Web page content related to the group of input queries, and will be discussed in greater detail with respect to
An overview of one exemplary embodiment of the technique having been provided, additional details regarding the automatic search intent mining technique will be provided in the following paragraphs.
In one embodiment of the automatic search intent mining technique, mining intents using Web page content involves extracting common concepts from the content related to the queries in a group.
As shown in
In one embodiment of the automatic search intent mining technique, mining intents using Web page structure involves extracting common information from the Web pages related to queries in a group by using the HTML structure information of those pages.
As shown in
In one embodiment of the automatic search intent mining technique, mining intents using search log data involves extracting common queries or sub-queries from a search query log which are related to the queries in a group.
In one embodiment of the automatic search intent technique, the technique integrates the candidate intents of all information/data sources discussed above and extracts the most common search intent candidates (e.g., key phrases) as the final intents of the queries in the group. One embodiment of the technique integrates all of the intent candidates obtained using the aforementioned three data/information sources by integrating them based on frequency. In addition, the technique can associate different weights with frequency for different sources. For example, the technique can give weight 2 to the candidates mined from web page content, which means if one candidate occurs 1 time in Web content candidate set, the technique treats it as 1*2=2 times while performing the integration. In one embodiment, by default, the technique usually assigns a weight of 1 to all of the candidates from all sources. In one embodiment more weights are given to sources that are more trusted.
It should be noted that while selecting search intent candidates from all sources generally yields better results, it is possible to select search intent candidates from only two sources, or even one source. The results depend on quality of different data sources as well as the input queries. Additionally, even though only three specific information sources are discussed herein, one with ordinary skill in the art will realize that other types of information sources could also be integrated with the information sources discussed here to find the final search intents.
As shown in block 502, a group of queries and associated search query log data is input. Then as shown in block 504, search intent candidates from at least one of search result content, search result structure and search result usage data are extracted. For example, extracting a set of search intent candidates by using Web page content for search results returned in response to the group of search queries generally involves extracting common concepts from the Web page content related to the group of input queries. Similarly, mining search intent candidates for the group of search queries by using Web structure for search results returned in response to the group of search queries generally involves extracting common information from the Web page content related to the group of input queries by using the HTML structure information of those pages. Additionally, mining search intent candidates for the group of search queries by using usage data from the query log data generally involves extracting common queries or sub-queries from the search query log which are related to the queries in the input group of queries. The candidate search intents extracted from any of the sources may be integrated to form a set of integrated search intent candidates, as shown in block 506. The most common search intent candidates are then extracted from the integrated search intent candidates as the final search intents (block 508). These final search intents can be used, for example, to assist in obtaining better search results for subsequent queries, for example, or to gather data on what users are searching for.
The automatic search intent mining technique is designed to operate in a computing environment. The following description is intended to provide a brief, general description of a suitable computing environment in which the automatic search intent mining technique can be implemented. The technique is operational with numerous general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable include, but are not limited to, personal computers, server computers, hand-held or laptop devices (for example, media players, notebook computers, cellular phones, personal data assistants, voice recorders), multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
Device 700 also can contain communications connection(s) 712 that allow the device to communicate with other devices and networks. Communications connection(s) 712 is an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, thereby changing the configuration or state of the receiving device of the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
Device 700 has a display device 722 and may have various input device(s) 714 such as a keyboard, mouse, pen, camera, touch input device, and so on. Output device(s) 716 devices such as a display, speakers, a printer, and so on may also be included. All of these devices are well known in the art and need not be discussed at length here.
The automatic search intent mining technique may be described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, and so on, that perform particular tasks or implement particular abstract data types. The automatic search intent mining technique may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should also be noted that any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.