The following is stored in the disambiguation database: links to redirect pages, links to disambiguation pages, and for each redirect and disambiguation page, the popularity, P, of the page and the page type, which, in one example comprises either a person page or an organization page. Even a very large encyclopedia can easily and quickly be processed to create a disambiguation database. And, as will be disclosed below, the disambiguation database may be accessed to disambiguate entities in an efficient, accurate, and computationally non-intensive manner.
Briefly, the disambiguation database may be queried for extracted ambiguous entities from an article. Direct and indirect links for page matches in the disambiguation database are counted by accessing the pages of the encyclopedia pointed to by the matches. A score is computed from the direct and indirect links, and the score is adjusted according to the popularity P of the matches in the disambiguation database. The entity is then disambiguated, that is a determination is made as to what type of entity it is (for example a person or an organization) according to the score, and type of page matched in the disambiguation database.
Turning now to
The article comprises entities, at least some of which are ambiguous entities. Each entity is a single-word entity or a multi-word entity. One example of a single-word entity is “Bush”. One example of a multi-word entity is “George Walker Bush”. The multi-word entity comprises the phrase fragments “George Walker” and “Walker Bush”.
Entities are extracted from the article to determine a first entity type. In one embodiment, shown in
Next, referring to
For each entity, the entity is split into its constituent words. For example, “George Walker Bush” is split into the words “George”, “Walker”, and “Bush”. Next each entity is compared with every other entity that comprise the same or greater number of words. For example, “George Walker Bush” is a three word entity and therefore is only compared against other entities having three or more words.
Next, compared entities are merged, that is, they are considered the same entity, if at least a subset of the their words match and appear in the same order. And, compared entities are merged if the initial letter of each of at least a subset of their words match and appear in the same order. By way of example, for one article, the entity “George Bush” is merged with the entity “George Walker Bush”. By way of another example, the entity “George W. Bush” is merged with “George Walker Bush”, “G. W. Bush”, “G. Bush”, “W. Bush”, “G. W. B.”, “G. Walker Bush”, “Geo. W. Bush”, and the like.
Then a single entity is chosen as representative of the merged entities. The entity chosen is the entity having the longest name. For example, with reference to the preceding example, the single entity chosen is “George Walker Bush” since it is the longest entity. Thus combining (step 34) results in the selection of one representative entity for many entities that are likely the same.
Referring to
Next, the disambiguation database is searched (step 38) for any disambiguation pages matching each extracted entity and entity alias. The search is case insensitive. If a matching page is a redirect page, then the page to which it redirects is followed and all of the outbound links from the followed redirect page are considered a match. If the matching page is a disambiguation page, then all of the outbound links from the matching disambiguation page are considered a match. Then, for each link considered a match, a list of links to other pages to which the matching page links is created (step 40).
Continuing, each entity and alias is scored (step 42). The score is computed based on the number of direct links and indirect links to matching pages for other entities and aliases. For example, “George Bush” and “White House” are aliases for different entities. In this example, assume both entities have one direct link to each other, that is the “George Bush” entity page links to the “White House” entity page exactly one time. Also assume both entities have fifty links to a separate third page, that is the entities links to each other fifty times, indirectly through the separate third page. For example, the third page may be a “Pentagon” entity page, even if “Pentagon” is not one of the extracted entities.
So, the score for a for an entity or alias pointing to a page A is computed as follows:
Then the score is adjusted (step 44) according to whether the title of the matching page and entity name are an exact match. For example, the score is adjusted if both the entity name and the matching page name is “George W. Bush”. In one embodiment the score is adjusted as follows: Score(A)=Score(A)* 20.
Next, the highest scoring alias is selected (step 46). Therefore, the highest scoring alias is the representative name of the entity, and the matching page referenced by the alias is the representative page of the entity. Also, a unique identifier may optionally be assigned to the to selected alias (step 48). For example “George Walker Bush” may have an identifier 56700231. Thus any extracted entities named “George Walker Bush” are referenced to this identifier. So, later, if a better name (higher scoring) for the entity is found, for example “President George W. Bush”, the name can be changed while maintaining the referenced page.
So, as disclosed, a single page in the encyclopedia is found for each extracted entity by way of the disambiguation database. Since each entity can now reference exactly one encyclopedia page, the entity type is determined by checking the page type of encyclopedia page as stored in the disambiguation database (step 50). In one example, the page type is either a person page, or an organization page.
In one more example, “George Bush” is extracted as an entity in an article. The encyclopedia page, for example a disambiguation page, shows several names with links to corresponding pages, including “George W. Bush”, “George H. W. Bush”, “George P. Bush”, and “George Bush (musician)”. Other extracted entities of the article include “The Pentagon”, “White House”, and “Tony Blair”. The pages “George W. Bush” and “George H. W. Bush” have a high popularity score according to the disambiguation database, and they have a multiplicity of links to other entities. However neither page is an exact match for “George Bush”. “George Bush” the musician however is an exact match, but is has a low popularity and no links with the other extracted entities “The Pentagon”, “White House”, and “Tony Blair”. Thus, according to the methods disclosed above, because “George W. Bush” has links to “Tony Blair” as well as to the other entities, “George W. Bush” will have the highest score and the encyclopedia page for the president “George W. Bush” will be selected as the actual entity in the article.
Modifications may be made to the above disclosed methods. For example the correctness of entity type of step 50 can be reinforced (step 52). In this embodiment, a first entity type is determined in step 32 and the entity type of step 50 is compared with the first entity type. If first entity type of step 32 and the entity type of step 50 match then the entity type of step 50 is flagged. The flag indicates that the entity type has a very high reliability of being correct.
In another embodiment shown in
In an embodiment, after disambiguation (step 62) a record is created of the matching disambiguation database entry of the entity so that, at a later time, the abstract, brief description, or other information can be retrieved (step 64) from the matching encyclopedia page by simply referencing the record, rather than having to repeat the steps of disambiguation (step 62).
The foregoing detailed description has discussed only a few of the many forms that this invention can take. It is intended that the foregoing detailed description be understood as an illustration of selected forms that the invention can take and not as a definition of the invention. It is only the following claims, including all equivalents, that are intended to define the scope of this invention.
This application is related to U.S. patent application Ser. No. 11/463,061 filed Aug. 8, 2006 by Kenneth Alexander Ellis, and entitled “Method for creating a disambiguation database,” the entirety of which is hereby incorporated by reference.