This invention relates in general to the field of string association. More particularly, this invention relates to finding associations between short text strings.
There are a number of applications where short text strings need to be conceptually linked to (or mapped to) other short text strings. For example, in classifier training, there is a need to associate queries from a query log to tasks or intent descriptions. In search situations, it may be desirable to associate additional metadata with search terms. If the strings to be matched are sufficiently long, word overlaps between the strings could be used to determine if they are related. However, if the strings are short, it can be very difficult to recognize possible relationships or associations needed to create a mapping between the strings. This is a result of insufficient information contained in the strings themselves, through which associations can be recognized and mappings can be created.
Previously, human annotators, skilled in the relevant technical field, have been used to create the mappings between the strings. This can be a slow and labor intensive process. In classifier training, for example, human annotators, for each given task, manually select queries that they find related to the task. Given that there may exist hundreds of tasks and thousands of queries, it is difficult for annotators to keep all the tasks and queries in mind and to do a consistent job of annotation. In addition, because of human cognitive limitations, the process can be error-prone and inconsistent. In order to reduce error, multiple annotators can work on the same query to task mapping. However, given the complexity of the field and the level of knowledge required by the annotators, the use of multiple human annotators can be very expensive.
In view of the foregoing, there is a need for systems and methods that overcome the limitations and drawbacks of the prior art.
A semi-automated system is used to generate candidate mappings between two sets of short strings, which can then be reviewed by annotators. A sufficiently large set of files, preferably related to the two sets of strings, is chosen. Each string from the two sets of strings is searched for in the large set of files. Each file that matches a string is presumed to be related to that string, and can provide additional information and context about the string that is used to generate the candidate mappings between the two sets of strings. Specifically, any two strings that match a certain number of files are presumed to be related, and are mapped together. These candidate mappings can then be checked by annotators.
Rather than having the annotators generate the candidate mappings, as shown in the prior art, the annotators may act as reviewers in conjunction with the candidate mappings of the present invention. They do not have to keep in mind all the strings from each set, they can just verify if the candidate mappings appear meaningful (i.e., are appropriate) or not. This is a less-error prone and a much faster process. Since the candidate mappings are generated automatically, they are far more consistent. Thus, annotating data in accordance with the present invention will be much cheaper and result in higher overall mapping quality. In addition, this method will work with strings in any language.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments that proceeds with reference to the accompanying drawings.
Task 202 and query 101 are mapped to a set of text files, shown in
More particularly,
As shown at 120, query 101 is mapped to several files (represented as space 120) in search space 110. To determine the mapping, each file in search space 110 is desirably text searched for query 101. In order to text search a file, the file is desirably scanned or searched for occurrences of the word or term that query 101 represents. The text searching can be done using any system, method, or technique known in the art for searching files for text strings. Any file that results in a match is presumably related to query 101, and can provide further information regarding the meaning of query 101. A match can be an exact match; for example, the word or term appears exactly in the text of the file. The match can also be a partial match, where only part of the word or term is found in the file. In addition, more sophisticated searching methods can be used to find matches, such as considering common misspellings or morphological variants (e.g. ‘run’, ‘ran’, ‘running’ as alternates for ‘runs’) for the searched terms. Any system, technique, or method known in the art for matching text strings can be used.
This information can then be used to generate a candidate mapping. The set of matching files is shown on
Whether or not a particular matched file is related to query 101 depends on both the size of the search space 110 and the relatedness of the search space 110 to the query. For example, if a large search space is chosen, for example, the internet, it is conceivable that no match could be found between any two text strings. If a search space is chosen that is too small, too many matches may be found. Therefore, it is critical that the search space 110 be chosen carefully.
One method for ensuring that a given match is meaningful and to reduce coincidental matches is to only consider matches that achieve a ranking above a certain user determined ranking. The ranking can be generated using any system, method or technique known in the art for ranking returned matches for a particular search term. For example, the user determined ranking is desirably some number dependent on, related to or otherwise representing the number of times a searched term must appear in a file before that term will be considered to match that file. This number can be determined through experimentation, and adjusted depending on the number of files in the search space 110, as well as the number of files matched for any given search term.
For example, query 101 may appear in a particular file only one time, while it may appear in another file one hundred times. Intuitively, query 101 is more likely to be related to the file where it appears one hundred times than the file that it appears in only once. An embodiment can exploit this by only considering files that contain the query 101 greater than some user determined frequency or number of times. While this example discusses ranking search results based on the frequency of the search term appearing in a particular file, any other methods for ranking search results may be used. In addition, this ranking can be further used to rank proposed query to task mappings, as further discussed with respect to
As illustrated in
The relationship between the size of overlap 350 and the probability of a relationship existing between query 101 and task 202 can be used to rank or assign weights to a proposed mapping. As described further with respect to
As discussed above, human reviewers can be used to verify matches. These human reviewers are expensive and time consuming. Thus, it is desirable to minimize the time spent by humans in reviewing proposed matches. To this end, proposed matches can be ranked, and those matches that fall below a certain desirably user determined threshold can be eliminated. Thus, the match(es) will not be sent to human annotators to verify the match. The user determined threshold can be determined by an administrator depending on factors such as the number of proposed matches, and the number of files in the search space 110. An exemplary method is described in more detail with respect to
The ranked list of files from the sample set of files that match each of the tasks is inverted to give a list of each file and the weighed lists of tasks matching that file. The list of queries and the matching files can be combined with the list of files and matching tasks to generate a weighted list of queries and matching tasks. While the exemplary embodiment is discussed with reference to tasks and queries, the method is applicable for creating a mapping between any sets of short strings.
More particularly, at 401, the file set is created. As previously discussed with respect to
At 405, an index is desirably created using the selected files. Indexing a set of files allows for the files to be quickly searched. An index entry for a file could comprise a list of every word contained in that file. A more sophisticated index might comprise the number of occurrences of each word in a file, allowing a match to be given a rank or likelihood that the match is meaningful. The more times a matched word appears in a file, the higher the likelihood that the file is related to the matched word. Similarly, a given file index can be improved through the use of text normalization, including the use of spelling, morphological analysis, punctuation, phrases etc. For example, common misspellings of words found in the files can be included in the index. In one embodiment, a standard operating system indexing service may be used to create the file index, but any system, method, or technique known in the art for creating an index on a group of files can be used.
At 408, each of the tasks is searched on the index of the files. A list containing the files that matched each of the tasks is desirably generated. Given the type of indexing used, the list of files matching each task can be ranked or given a confidence level indicating the quality of the match or the likelihood that it is accurate. The list of files can then be reduced by eliminating the matches below a (e.g., user determined) rank or confidence level. It is contemplated that any system, method, or technique known in the art for file searching can be used.
At 411, a new list, comprising an entry for each file in the file set and the associated tasks matching the file entry, is desirably generated from the list comprising an entry for each task and the files that contained that task. The list is desirably generated by inverting or reversing the list comprising an entry for each task and the files that contained that task. The new list comprises an entry for each file in the file set and the associated tasks matching the file entry. Any rankings or confidence level associated with each match is desirably preserved in the new list.
At 415, each of the queries is searched on the same index of the files as the tasks. A list containing the files that matched each of the queries is desirably generated. A rank or confidence level is desirably specified for each match. Similar to the task reduction set forth above, given the type of indexing used, the list of files matching each query can be reduced by eliminating the matches below a user determined rank or confidence level. Any system, method, or technique known in the art for file searching can be used.
At 417, the generated list containing the query to files mapping is desirably combined with the list containing the files to task mapping, creating the query to task mapping. In addition, as described further below with respect to
At 501, the mapping from the queries to the files is generated. Assume for the purposes of this example that there are three query terms 1-3, and fifteen text files 1-15. As shown, query 1 maps to files 3, 5, 10, and 15; query 2 maps to files 5 and 15; and query 3 maps to file 3. In this example, a particular query is found to map to a file when the query term appears at least once in the file.
As discussed with respect to
At 505, the mapping from the queries to the files is desirably inverted or reversed, providing a mapping from the files to the queries. As shown, file 3 maps to queries 1 and 3; file 5 maps to queries 2 and 1; file 10 maps to query 1; and file 15 maps to queries 2 and 1. Files 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13, and 14 are omitted because they did not match with any of the queries.
At 508, the mapping from the tasks to the files is generated. Assume for the purposes of this example that there are three task terms 1-3, and fifteen text files 1-15. As shown, task 1 maps to files 5 and 10; task 2 maps to files 3, 10, and 15; and task 3 maps to file 15.
At 511, the mapping from the tasks to the files is combined with the mapping from the files to the queries, creating a mapping from the tasks to queries. Each file can map to several different queries, and several different tasks. As a result, when the two mappings are combined, some tasks are shown to map to the same query multiple times. Rather than being redundant, the number of times a task matches with a particular query can provide insight as how good of a match it is. As shown, task 1 maps to query 2 once and query 1 twice; task 2 maps to query 1 thrice, query 2 once and query 3 once; and task 3 maps to query 2 once and query 1 once.
At 515, a ranking or a confidence level for each mapping is generated. As shown, each task to query mapping is ranked by the number of duplicate matches found. Each duplicate mapping represents a file that contained both the query term and the task term. The greater the rank, the greater the chance that the mapping between the tasks and queries is meaningful.
In addition to ranking by the number of duplicate matches, the ranking or confidence level for each mapping can be generated using any system, method, or technique known in the art for assigning weights or confidence levels to searched terms. For example, if the weights returned by the search system (the degree of match) is used, then it may happen that in some cases, there may be single large weight overlap, which is more significant than a duplicate being found.
In order to save time and money spent on human review of the generated mappings, a user can filter the generated mappings based on some threshold. The reviewers examine each generated mapping in order to determine if a real relationship between the query and task exists, or if the match was just a coincidence or the result of a poor text file in the set of files. Because the review is an expensive process, done by those skilled in the art, it is desirable to minimize the number of mappings that are reviewed. To this end, the user desirably determines the minimum ranking that can be found between a task and a query before the mapping will be considered by the reviewers. In the example described with respect to
The selector component 602 is desirably used to select a set of files that can be used to create a mapping between a set of short query strings and a set of short task strings. Because the queries and tasks are short strings, there is little information through which a mapping can be generated. As described with respect to
The searcher component 605 is desirably used to search the selected text files for occurrences of the strings from the set of queries and the set of tasks. Each query and task is desirably text searched in the set of files. As discussed further with respect to
The first generator component 606 is desirably used to generate the mapping between the queries and the set of files. The generated mapping can comprise a list containing an entry for each query term, along with each file from the set of files that contains that query term. The generated mapping can be further refined by the first generator component 606, for a given term, by only adding files that achieved a certain rank or confidence level. For example, a given file that is found to match a particular query term by the searcher component 605 may have received a low weight, while another file that matches the query term may have received a very high weight. By definition, the file with the high weight is more likely to be related to the query term than the file with the low weight. The first generator component 606 can add entries to the list where the file matches the query term with a weight or confidence level above a user specified amount. The first generator 606 can be implemented in hardware, software, or a combination of both.
The second generator component 607 is desirably used to generate the mapping between the tasks and the selected files. The generated mapping can comprise a list containing an entry for each task term, along with each file from the set of files, that contains that task term. The generated mapping can be further refined by the second generator component 607, for a given term, by only adding files that contained the task term having a weight or confidence level above a certain user specified amount. This is described in greater detail with respect to the first generator component 606. The second generator component 607 can be implemented using hardware, software, or a combination of both.
The third generator component 611 is desirably used to generate the mapping between the set of short queries and the set of short tasks. The mapping is desirably generated by combining the mapping from the query terms to the file set with the mapping from the task terms to the file set. Each individual mapping between a query and a task corresponds with at least one file in the file set that contained both the query and the task term. Some query and task terms were matched or contained together in multiple files from the file set. The third generator component 611 can further refine the mapping by eliminating those query and task mappings that appeared together in less than some determined threshold. The threshold can be determined with reference to the total number of proposed mappings, or the size of the initial file set.
Similarly, the mapping between the query and task terms can be refined by creating a ranking or confidence level for each mapping based on underlying ranking or confidence level associated with the query to file mapping and the task to file mapping. Each matched query and task term has an associated weight or confidence level for both the underlying query to file mapping and the task to file mapping, as generated by the searcher component 605. A composite ranking can be generated for the query to task mapping by combining the two rankings. The third generator component 611 can eliminate those query and task mappings that receive a ranking below some determined threshold. The third generator component 611 can be implemented in hardware, software, or a combination of both.
The reviewer component 615 desirably determines which of the generated mappings between queries and tasks are meaningful, and desirably eliminates the mappings that are not meaningful. Human annotators acting as reviewers, desirably skilled with respect to the relevant subject of the query and task terms, can examine each mapping and eliminate a mapping if the query and task term do not appear to be related. This review can also be automated or computerized. In such cases, this reviewer component 615 can be implemented in hardware, software, or a combination of both.
Exemplary Computing Environment
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules and other data may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 710 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 710 and includes both volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computer 710. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 730 includes computer storage media in the form of volatile and/or non-volatile memory such as ROM 731 and RAM 732. A basic input/output system 733 (BIOS), containing the basic routines that help to transfer information between elements within computer 710, such as during start-up, is typically stored in ROM 731. RAM 732 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 720. By way of example, and not limitation,
The computer 710 may also include other removable/non-removable, volatile/non-volatile computer storage media. By way of example only,
The drives and their associated computer storage media provide storage of computer readable instructions, data structures, program modules and other data for the computer 710. In
The computer 710 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 780. The remote computer 780 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 710, although only a memory storage device 781 has been illustrated in
When used in a LAN networking environment, the computer 710 is connected to the LAN 771 through a network interface or adapter 770. When used in a WAN networking environment, the computer 710 typically includes a modem 772 or other means for establishing communications over the WAN 773, such as the internet. The modem 772, which may be internal or external, may be connected to the system bus 721 via the user input interface 760, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 710, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
As mentioned above, while exemplary embodiments of the present invention have been described in connection with various computing devices, the underlying concepts may be applied to any computing device or system.
The various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. The program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations.
The methods and apparatus of the present invention may also be practiced via communications embodied in the form of program code that is transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via any other form of transmission, wherein, when the program code is received and loaded into and executed by a machine, such as an EPROM, a gate array, a programmable logic device (PLD), a client computer, or the like, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code combines with the processor to provide a unique apparatus that operates to invoke the functionality of the present invention. Additionally, any storage techniques used in connection with the present invention may invariably be a combination of hardware and software.
While the present invention has been described in connection with the preferred embodiments of the various figures, it is to be understood that other similar embodiments may be used or modifications and additions may be made to the described embodiments for performing the same function of the present invention without deviating therefrom. Therefore, the present invention should not be limited to any single embodiment, but rather should be construed in breadth and scope in accordance with the appended claims.