Web search engine providers are in a competitive business. To satisfy the requirements of end-users so as to gain new users and keep existing users from switching to a competing search engine, search providers need to continuously improve the quality of the search results returned to the users.
To improve the quality of search results, search engine providers experiment with many new ideas to see which ones are effective. This includes coming up with new ways to re-rank the results of a specific query set, and new ways to build indexes, e.g., via new features. To experiment with a new idea, small datasets are used. However, in practice, many ideas which are shown to be promising in small datasets, and/or ideas for re-ranking experiments, turn out to not be effective in the production system, and indeed may make the quality of the search results worse.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which search-related experiments may be run on a full or partial snapshot copy of search engine data used in an actual production system. In one implementation, an experimental repository maintains page content and metadata synchronized from the actual data used by a production search engine. A metadata extraction subsystem processes documents and extracts features based upon the page content and metadata of the experimental repository to provide offline data. A snapshot experimentation subsystem runs experimental code related to web searches on the offline data, including to run experimental index building code to build an experimental index, and/or to run experimental search-related code, such as to rank search results according to experimental ranking code, to implement an experimental search strategy, and/or to generate experimental captions.
In one implementation, multiple users may run experiments simultaneously as scheduled by a master component of the snapshot experimentation subsystem. A front end component allows user interaction with the snapshot experimentation subsystem, e.g., to submit queries and obtain search results. In general, the use of multiple index units allows users to quickly apply newly extracted features into the index, e.g., by only needing to re-build part of the index.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards an experimental search system for facilitating large-scale, end-to-end search experiments which generate reliable experimentation results. As will be understood, the experimental search system provides reliable experiments so that search improvements can be reproduced in a production system, along with facilitating straightforward and flexible experimentation with respect to implementing experimental logic and/or performing various types of experiments. Further, in one implementation, multiple users can simultaneously and efficiently perform independent experiments without interfering with each other, although when desired, users can also share the resources (data and experimental logic) of other users. Also described is monitoring and debugging, so that experimenters are able to monitor the progress of their experiments and obtain information from the system to find out the reasons for any errors.
It should be understood that any of the examples herein are non-limiting. Indeed, as one example, a particular implementation having various components and interfaces is described, however this is only one example. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and search technology in general.
Turning to
At another level shown in
With respect to offline experimentation, types of experiments that can be run include snapshot experiments and re-ranking experiments. In snapshot experiments, a snapshot of the crawled web pages is indexed, and an experimental search engine is built. In the experimental search engine, a query is able to be processed by the same process as in the production system, (with the exception that the data/web pages are less fresh). This allows the effectiveness of experimental feature usage to be tested accurately. Re-ranking experiments refer to the activities of re-ordering the top-k results of a list of search engines via a ranking function, where k might be a relatively small (e.g. 5, 10, 100) or relatively larger number (e.g. 1,000, 4,000, 10,000).
The WebStudio 102 synchronizes web pages from the document processing cluster 108 of the production pipeline and may maintain multiple page snapshots. Experimental users (e.g., developers or researchers) choose a snapshot and build an end-to-end search engine via WebStudio 102 for that snapshot. Users can customize major operations including document parsing, page classification, index building, index serving, and front-end processing in the end-to-end search engine, by adding their own experimental logic for testing ideas. Any non-experimental logic may be reused in conjunction with the experimental logic. As described below, as an end-to-end search system, the WebStudio 102 provides search interfaces, by which external systems (including people) can input queries and get search results.
By way of example, a re-ranking experimentation system 118 is able to input a query list and get search results from the WebStudio 102. These results may be evaluated, such as against test data, to see if the experimental re-ranking improved the quality of the search results.
The arrows labeled (1), (2), and (3) in
Turning to additional details, in one implementation of the WebStudio 102 as represented in
The metadata extraction (cluster) subsystem 222 extracts in-band and out-of-band document features, which are used in the index building and ranking experiments of the snapshot experimentation subsystem 224. In one implementation, the experimental repository subsystem 220 is the repository for storing and managing multiple web snapshots. The metadata extraction subsystem 222 is generally in charge of in-band document processing 230 (document parsing and classification), and out-of-band feature extraction 232 (including static-rank computation).
To perform these operations, an execution engine 234 of the metadata extraction cluster subsystem 222 may retrieve specific rows and columns from the data store 228 of the experimental repository 220, perform the processing, and write the extracted features (e.g., as new columns) back into the data store 228 of the experimental repository 220 through a snapshot management module.
The operations that are related to indexing and ranking are performed in the snapshot experimentation subsystem 224. Note that a feature transformation operation may be performed (block 240) to transform a document feature into a format (e.g. indexable document format) which can be directly indexed by the index builder 242. Feature transformation may be treated as a preprocessing step for index building.
The snapshot experimentation subsystem 224 allows users to run index building 242, index serving 244, and front-end processing tasks 246. The snapshot experimentation subsystem synchronizes snapshots from the experimental repository subsystem 220 and enables users (e.g., the WebStudio client 250) to perform indexing and ranking experiments based on the snapshots. A snapshot may be a full web snapshot, e.g., containing all the web pages indexed by the production system (e.g., on the order of twenty billion web pages), or possibly all the pages that have ever been on the document processing, or a partial snapshot, e.g., containing some subset of web pages.
In performing experiments to test the effectiveness of a new feature, experimenters may first start an index building task (block 242), and a new index unit (described below) is generated from the pages in the snapshot. Then an index serving task 244 and a front-end task 246 are performed, from which new ranking results are retrieved and the effectiveness of the new feature is evaluated. An experimenter can also directly start from the index serving task 244 and a front-end task 246. For example, this may be done to perform ranking experiments without changing the index's data. As another example, an experimenter can use the existing generated index units (self-generated or shared from another experimenter) to compose a new experiment.
In one implementation, the snapshot experimentation subsystem also hosts data probing and analysis tools for facilitating analysis and diagnosis.
Note that the amounts of data being processed are extremely large, and the experimental repository, metadata extraction subsystems, and snapshot experimentation subsystems are virtual clusters; they are not necessarily physical clusters. Different deployment options provide different possibilities of mapping subsystems to physical clusters. Factors when considering deployment include system maintainability, performance (including time for synchronizing snapshots, index building speed, and query processing speed), and cost (e.g., based upon the number of machines required, inter-cluster and intra-cluster bandwidth requirements and disk and memory requirements). Note that different deployment options may have different requirements to the number of machines (and also CPU, memory, and disk) in each physical cluster.
For example, in one example deployment option, two large physical clusters may be provided, namely an experimental repository-main cluster and a snapshot experimentation-main cluster. In this example, the metadata extraction cluster subsystem is located on the same physical cluster (on the order of thousands of machines, such as 2,000) with the experimental repository subsystem, which enables metadata extraction to be performed on the machine/server cluster where the data is stored. The full snapshot experimental subsystem may be a generally similar number of machines (e.g., 1,500). In addition, there may be smaller snapshot experimentation clusters for performing experiments based on partial snapshots (e.g., two such clusters of 50 machines each).
An alternative example deployment option is to have only one large physical cluster (e.g., comprising 3,500 machines), in which the metadata extraction cluster, experimental repository, and snapshot experimentation subsystems are located on the same physical cluster. Similar to the other deployment option, two small snapshot experimentation clusters (e.g., two such clusters of 50 machines each) may be provided to synchronize partial snapshots from the large physical cluster to facilitate experiments on a subset of data.
As can be readily appreciated, one advantage of a single physical cluster over multiple physical clusters is that the data transfer from the experimental repository to the snapshot experimentation subsystem is avoided. However, the performance of index building and query processing experiments may be reduced by the snapshot management operations performed on the same cluster. Thus, a consideration when choosing a deployment may be how many experiments are to be performed on full snapshots.
As described above, the experimental repository 220 stores the document contents and metadata used for doing offline experiments, as continuously or periodically synchronized from the online document processing cluster 108 in the production system. Data from other sources (e.g. query logs, manually crawled pages, evaluation results, and so forth) also may be included. Note that multiple snapshots of the web may be stored, because different experiments may need to retrieve different data snapshots, for example.
Interfaces of the experimental repository include those shown in
Output interfaces provide for outputting a full snapshot, e.g., all rows (web pages, queries, and so forth) and columns (metadata) of a specific snapshot. Another output allows a caller to retrieve rows and columns according to specific filtering conditions and/or a list of URLs; by way of example, one caller may ask to retrieve all English web pages in a “Dec2008” snapshot, while another caller may ask to retrieve a random sample of 100 million product web pages from a “Sep2008” snapshot.
The metadata extraction subsystem 222 generates derived features for documents, sites, and/or queries. One such metadata extraction subsystem 222 may be dedicated to the generation of one feature (e.g. static-rank) or shared by multiple feature extraction tasks. The metadata extraction subsystem 222 may perform tasks such as HTML parsing, word breaking, site-map building, (advance) anchor text extraction, pornographic content detection, product page classification, and so forth. Note however that many of these operations are also supported on the snapshot experimentation cluster, because their corresponding features can be extracted in an in-band way. This provides flexibility for user experiments.
The output interface provides callers with the extracted features. Other interfaces may be used to provide debugging information, e.g., by which users can determine the reason for a task failure, and to provide status and progress information of tasks (e.g. the number of pages processed).
In one implementation represented in
Example interfaces of the snapshot experimentation subsystem 224 are represented in
Other interfaces provide an interactive interface for search results, e.g., a search user interface that gets the top-k results given a list of queries; (note that k can be arbitrarily set). Alternatively, batch processing may be performed on another interface, to provide a list of queries and get results for that list.
Still other interfaces are related to probing information for diagnosis, e.g., example, the anchor-text of a page, the word-breaking results of a query, and so forth, and debugging information, e.g., when an indexing process fails, users can determine the reason. Status and progress data are likewise available via an interface in this example implementation.
A user (e.g., represented by the client 770) adds an experimental feature into the index by building a DLL (dynamic-link library), and submitting the DLL into the snapshot experimentation cluster via a snapshot experimentation client (e.g., the client 770). The user then attempts to invoke a customized index building service with the submitted DLL. Note that other users (e.g., represented by the client 772) may also submit their own DLLs and ask to initiate their index-building tasks respectively.
A snapshot experimentation master component (e.g., a machine) 774 receives the requests and determines how to schedule them. For example, it may decide to run two or more different users' tasks simultaneously, and hold another user's task (or multiple other tasks) in a pending state. The master component 774 sends the job scheduling information to the index building and serving nodes 776.
After receiving the job scheduling commands from the master component 774, each index building and serving node may first start the two simultaneous tasks in this example. After these tasks are accomplished, the task that was pending starts to run. Each user can view the execution status of their jobs via their respective client. Other schedules are feasible.
Via a client (represented by 772), the user may communicate with a front end component 780 that performs web user interface functions, as well as search results aggregation and caching. Users can also insert experimentation logic (via DLLs) into the snapshot experimentation subsystem, and start an index building task and/or start and stop a ranking service. Users may also monitor the status of index building tasks or ranking services.
To this end, a baseline index builder 882 builds a base unit U0, and a flexible indexer 884 builds the incremental index units U1-Um based upon user submitted DLLs 886, e.g., that specify which columns to use, a new feature set, and so forth. The flexible indexer 884 thus allows each user to experiment with index building relative to the base index, e.g., via one or more new features.
For ranking and other search-related experiments, a flexible ranker 888 allows user submitted DLLs 890 to perform the ranking, (although there may not be any such DLLs in a given experiment that only varies index building). For example, a user may submit an experimental ranking function with a new way to compute document scores, and/or a new search strategy for finding relevant pages. A new caption generator may similarly be uploaded for experimenting with its results. In this manner, different ranking experiments may be conducted instead of or in addition to index building experiments. A baseline retrieval engine 892 allows access to the page contents and metadata 880 and the baseline index unit U0 to provide search results for comparing against the experimental search results provided via the ranker 888.
The ranking module can load several index units together and make these index units that are served together appear to be one large index. It is transparent to the users how the data is stored, e.g., whether in one index unit or in multiple index units. The flexible ranker provides interfaces for a user's ranking function to access index data. The served index units are generated from the same data set (web snapshot). However, the index units can be generated from different fields or features of web pages. For example, the base unit may be generated from the full content of web pages, with other index units generated from the anchor text, click boost and so forth. Index units can also be generated by different experimenters for different times. The flexible ranker automatically assigns these index units together to form a virtual large index.
The flexible ranker also allows users to replace a feature in one index unit by a newly extracted feature in another index unit. For example, one user may implement an advance anchor extraction algorithm, extract a new “anchor text” feature and build an index unit for the new anchor text feature. Then the user can configure the flexible ranker to use new anchor text in the newly generated index unit to replace the one in the original index unit (for example, the base index unit). Index units thus enable the experimenter to quickly test an idea. Further, because search data is very large, it takes a significant amount of time and resources to otherwise rebuild an index to add new data or extract new features. With the flexible ranker, the experimenter need only build the new data/features into a small index unit and let the flexible ranker combine (add or replace) it with the original index unit (e.g. base index unit). The experimenter can also use others' index units to perform experiments.
As can be seen, there is provided a technology for conducting end-to-end experiments based on a full web document (web page) snapshot, along with mechanisms that allow experimenters to customize experimental search engines. This is in part accomplished via a flexible index building and serving mechanism that allows building/modifying index incrementally and efficiently, e.g., by splitting the index into multiple index units that can be each built/modified independently, greatly saving index building time. The retrieval engine can use one or more index units to produce a combined experimentation search engine.
Experimenters can share their experiments with others, including index units and related codes, in order to enable easy collaboration of search engine experimentation. Further, experimenters can insert DLLs to implement and add various types of experimentation logic into the system.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 910 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 910 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 910. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 930 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 931 and random access memory (RAM) 932. A basic input/output system 933 (BIOS), containing the basic routines that help to transfer information between elements within computer 910, such as during start-up, is typically stored in ROM 931. RAM 932 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 920. By way of example, and not limitation,
The computer 910 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 910 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 980. The remote computer 980 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 910, although only a memory storage device 981 has been illustrated in
When used in a LAN networking environment, the computer 910 is connected to the LAN 971 through a network interface or adapter 970. When used in a WAN networking environment, the computer 910 typically includes a modem 972 or other means for establishing communications over the WAN 973, such as the Internet. The modem 972, which may be internal or external, may be connected to the system bus 921 via the user input interface 960 or other appropriate mechanism. A wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 910, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 999 (e.g., for auxiliary display of content) may be connected via the user interface 960 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 999 may be connected to the modem 972 and/or network interface 970 to allow communication between these systems while the main processing unit 920 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.
The present application is related to United States patent application Ser. No. ______ (attorney docket no. 327768.01), entitled “Flexible Indexing and Ranking for Search,” filed concurrently herewith and hereby incorporated by reference.