This invention relates to information processing, and more particularly, to seed set expansion.
The Internet contains large amounts of information in unstructured or semi-structured form. Information extraction processes allow information contained within web pages to be made accessible, or at least more accessible, for machine processing. One use for information extraction processes might be analyzing a set of documents written in a natural language and populate a database with the information extracted.
The memory 14 can include a seed expansion component 20 to determine new members for a seed set 21 comprising a list of members of a class of named entities from the set of web pages associated with an organization. For example, the seed expansion component 20 can include a context-based extractor 22 to generate a set of context-based candidate members of the class of named entities as words found to be connected with an element from the seed set 21 via a contextual pattern. For example, contextual patterns signifying a relationship between two words can be determined from the set of web pages (e.g., obtained via the communications interface) and the seed set 21, and words connected to a member of the seed set by one of the determined patterns can be selected as candidate members. Each selected candidate member is then assigned a context confidence value. For example, the context confidence value can be assigned according to a confidence associated with the contextual pattern, with contextual patterns that occur frequently with semantically related word pairs within the set of web pages and infrequently with unrelated word pairs having a high confidence.
The seed expansion component 20 further comprises a list-based extractor 24 to generate a set of list-based candidate members from elements within lists found within the set of web pages. For example, the list-based extractor 24 can locate tags in a markup language, such as hypertext markup language (HTML), indicative of a list, and the elements of each list can be extracted. From these elements, the list-based extractor 24 can determine a set of list-based candidate members of the class of named entities. The list-based extractor 24 then determines a list confidence value for each candidate member. For example, the list confidence value for a given candidate can be assigned according to a confidence associated with a list or lists from which the candidate was extracted. The confidence associated with each list can be determined according to its degree of overlap in elements with other lists from the set of web pages.
A confidence arbitrator 26 receives the set of context-based candidate members from the context-based extractor 22 and the set of list-based candidate members from the list-based extractor 24 as well as their respective associated confidence values. The confidence arbitrator 26 determines an intersection set of candidate members that are present in both sets of candidate members. The confidence arbitrator 26 also determines a final confidence value for each of the intersection set of new candidate members. For example, the confidence arbitrator 26 can calculate the final confidence value for each new candidate member as a weighted linear combination of the confidence values or selected as one of the confidence values from the input sets of candidate members. In one example implementation, the lesser value of the two confidence values is selected as the final confidence value.
The various candidate members and confidence values can then be provided to a candidate selector 28. The candidate selector 28 is programmed to select candidates for inclusion in the class of named entities according to their arbitrated confidence values. For example, the candidate selector 28 can select a predetermined number of highest ranked candidates in an ordinal ranking by confidence value or the candidate selector 28 can select all candidates having a confidence value greater than a threshold value for addition to the seed set 21. Once the candidate selector 28 has selected additions to the seed set, they can be provided to a user through a user interface 32 at an associated output device (e.g., including a display) 34.
At 52, a set of context-based candidate members of the class of named entities are extracted as words in the set of web pages that are connected with one of a set of known members of the class via a contextual pattern. For example, a plurality of contextual patterns that signify a relationship between two words can be determined from the set of web pages and the set of known members. Words connected to one of the set of known members by one of the determined patterns can be selected as candidate members. At 54, a context confidence value is determined for each of the set of context-based candidates. In one example implementation, the context confidence value can be assigned to a given candidate according to a confidence associated with the contextual pattern used to select the candidate. For example, contextual patterns that occur frequently with semantically related word pairs within the set of web pages and infrequently with unrelated word pairs will have a high confidence
At 56, a plurality of lists, each having a plurality of elements, are identified within the set of web pages. For example, tags (e.g., <TD> and <LI>) in hypertext markup language (HTML) indicative of a list can be located, and the elements of the list can be extracted. At 58, a set of list-based candidate members of the class of named entities are extracted from the elements comprising the plurality of lists. At 60, a list confidence value is determined for each of the set of list-based candidate members. For example, the list confidence value for a given candidate can be assigned according to a confidence associated with a list or lists from which the candidate was extracted, with the confidence associated with each list can be determined according to its degree of overlap in elements with other lists from the set of web pages.
At 62, an intersection set of candidates are identified as those words that that are common to the set of context candidates and the set of list-based candidates. At 64, a final confidence value is determined for each of the intersection set of candidate members from the context confidence value and the list confidence value associated with the member. Any appropriate method for arbitrating between the confidence values can be used, including taking a weighted linear combination of the confidence values or selecting one of the context or list confidence values. In an example implementation, the lesser value of the two confidence values is selected as the final confidence value to reduce the likelihood of false positives.
At 66, a plurality of candidates are selected for inclusion in the class of named entities according to their associated confidence values. The selected candidates can be drawn from the context-based set of candidates, the list-based set of candidates, and the intersection set. For example, a predetermined number of highest ranked candidates in an ordinal ranking by confidence value can be selected. In an example implementation, all candidates having a confidence value greater than a threshold value can be selected for addition to the class and stored in memory. The selected candidates are stored in memory at 68. The selected members can then be displayed to a user at an associated output device or used by another application.
The system 150 comprises a word pair generation component 152 to generate word pairs from an associated lexical database 154. One example of the lexical database 154 that can be used for this purpose for the English language is WordNet, and similar lexical databases exist for other languages. The word pair generator 152 can generate a set of word pairs comprising semantically related word pairs and a set of word pairs, comprising unrelated word pairs. For example, each the set of related word pairs can be selected to be synonymous and each of the set of unrelated words can be randomly selected by the word pair generation component 152.
The sets of word pairs are provided to a contextual pattern extractor 156. The contextual pattern extractor 156 is programmed to evaluate the set of web pages associated with the organization to determine contextual patterns within the web pages that signify that two words are semantically related. To this end, each word pair can be used as a query on the set of web pages, and the contextual pattern extractor 156 extracts a plurality of word strings containing both words in the word pair. In an example implementation, the contextual pattern extractor 156 can limit the length of the word strings to a predetermined number of words, such as strings of three, four, or five words. Each word string, or more specifically the words other than the queried word pair within the word string, can be evaluated as a candidate pattern connecting the queried word pair. Accordingly, the contextual pattern extractor 156 produces a set of candidate patterns for each of the set of related word pairs and the set of unrelated word pairs.
The sets of candidate patterns are then provided to a confidence value calculator 158 to provide a confidence value for each candidate pattern. To this end, a number of occurrences of a given candidate pattern connecting a word pair in the set of related word pairs can be compared to a number of occurrences of the candidate pattern connecting a word pair in the set of unrelated word pairs. Based on this comparison, an appropriate confidence value for the pattern can be determined, representing the likelihood that a pair of words connected by the pattern are semantically related. In one example, the confidence value is determined by calculating a chi-squared value for a pattern according to its occurrence with related and non-related word pairs, such that:
The confidence value calculator 158 can determine a confidence for each candidate pattern for each pattern by comparing its associated chi-squared value to an appropriate chi-squared distribution. A set of candidate patterns are then selected for use as contextual patterns for locating members of the class of named entities. For example, a predetermined number of highest ranked candidate patterns in an ordinal ranking by confidence can be selected. Alternatively, all candidate patterns having a confidence value greater than a threshold value can be selected.
The selected contextual patterns are provided to a member selector 160 that scans the set of web pages associated with the organization to identify a set of candidate members. For example, the member selector 160 can locate each occurrence of the selected contextual patterns in the web pages in conjunction with a known member of the class of named entities, and a word connected to the known member by the contextual pattern can be extracted as a candidate class member. For each candidate class member, a confidence value can be determined for the element according to the contextual pattern or patterns that connect the element to one of the known members. For example, the candidate member can be assigned the confidence value associated with the contextual pattern connecting it to a known member.
Where the candidate member is found to be connected with a known class member, by multiple patterns, the confidence values associated with the various contextual patterns can be arbitrated by an appropriate method. For instance, the confidence value arbitration can include taking a weighted linear combination of the confidence values or selecting one of the confidence values of the candidate members being arbitrated. In one implementation, the largest confidence value from the confidence values associated with the contextual patterns can be selected. The set of candidate members and their associated confidence values can then be recorded in memory for further analysis.
The extracted lists are then provided to a graph constructor 204 to construct a weighted directed graph in which each node represents one of the extracted lists. The nodes for each pair of lists containing at least two common elements are connected by two edges, with each edge having a corresponding weight representing a degree of overlap between elements of the lists represented by the two nodes connected by the edge. It will be appreciated that edges are directional, such that a first edge connecting a first node to a second node can have a different weight than a second edge connecting the second node to the first node. In one implementation, the weight, wij, connecting for an edge connecting a first node, i, and a second node, j, can be calculated as:
C(n)=n*(n−1)/2.
A confidence calculator 206 processes the completed graph to calculate a confidence value for each list representing the likelihood that all of the elements on the list belong to a same class of entities. By way of example, a transition matrix can be used to calculate a normalized confidence for each list, with the transition matrix, T, being populated with each element, Ti,j, where Ti,j can be expressed as follows:
if the graph contains an edge, wj,1
Ti,j=0 otherwise.
Once the matrix is constructed, the transition matrix can be used to calculate raw confidence values via an iterative method. For example, the process begins with a set of initial raw confidence values for the n nodes provided as a n-element vector, t0, with each initial confidence value equal to 1/n. The initial confidence values are then iteratively refined using the transition matrix T, with each successive iteration being calculated as:
t
i+1=αBTti+(1−αB)t0 Eq. 3
where αs is a decay factor, set in one example to 0.85.
After a number of iterations, for example, twenty iterations, convergence can be achieved, and the resulting raw confidence values can be used to calculate a normalized confidence for each of the plurality of nodes. For a given set of raw confidence values, r1−rn, a normalized confidence value, ci, for each node can be calculated as:
The normalized confidence for each node represents the likelihood that each element on that list belongs to the class of named entities given that a known member of the class appears on the list. Accordingly, given an initial seed set, containing known members of the class of named entities, a set of elements from the plurality of lists that occur on a same list as one of the known members can be determined as candidate members of the class of named entities. For each candidate member, a confidence value can be determined for the element according to the list or lists that connect the element to one of the known members. For example, the candidate member can be assigned the confidence value associated with the list upon which it appears with a known member.
Where the candidate member appears with a known class member on multiple lists, the confidence values associated with the various lists can be arbitrated by an appropriate method, including taking a weighted linear combination of the confidence values or selecting one of the confidence values. In one implementation, the largest confidence value from the multiple lists can be selected. The set of candidate members and their associated confidence values can be recorded for further analysis.
What have been described above are examples. It is, of course, not possible to describe every conceivable combination of components or methods, but one of ordinary skill in the art will recognize that many further combinations and permutations are possible. Accordingly, the invention is intended to embrace all such alterations, modifications, and variations that fall within the scope of this application, including the appended claims. Additionally, where the disclosure or claims recite “a,” “an,” “a first,” or “another” element, or the equivalent thereof, it should be interpreted to include one or more than one such element, neither requiring nor excluding two or more such elements.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN10/78595 | 11/10/2010 | WO | 00 | 5/7/2013 |