The advent of internet has resulted in an information explosion like never before. With thousands of documents getting uploaded each day, the net has become the favorite place to search for information. A named entity (NE) search is one of the mechanisms to search for right information. A named entity, generally, refers to a word or groups of words, such as, the name of a company, a person, a location, a time, a date, a numerical value, etc. A named entity search may make the task of looking for relevant information relatively easier. However, searching for a complex named entity, such as, a group of words, with multiple simple named entities is not small task, given the corpus of search documents could potentially be millions of documents, if the search is being done on the internet.
A number of methods have been reported for named entity extraction. Some of these methods utilize machine learning techniques to train models to extract common named entities from high-quality newswire text. They focus on the use of statistical models such as Hidden Markov Models, rule learning, and Maximum Entropy Markov Models, for a specific typical NE type. These studies learn the models or rules from a hand-tagged training corpus, so the models and rules are only effective on a similar corpus, and would perform poorly on other corpus with a different statistical characteristic or different genre or style. Due to the high cost of training models for each specific NE type, these approaches cannot fulfill the need of a general named entity extraction.
For a better understanding of the invention, embodiments will now be described, purely by way of example, with reference to the accompanying drawings, in which:
The following terms are used interchangeably through out the document including the accompanying drawings.
(a) “node” and “named entity”
(b) “document” and “electronic document”
Embodiments of the present invention provide methods, computer executable code and computer storage medium for extracting named entities (NE) from a document or a corpus of documents.
Embodiments of the present invention aim to perform an effective extraction of named entities on a low-quality corpus, and to extract any types of entities with minimum cost. The proposed method accommodates the diversity of documents (such as, in the organizational webpages), and is efficient to extract large numbers of named entities on a large-scale corpus. The embodiments effectively extract named entities from a large-scale document corpus where content redundancy is less distinct than the web-scale corpus.
The method begins in step 110. In step 110, a document or a corpus of documents is accessed, and named entities (NE) appearing in the document or corpus of documents are identified, from which a set of seed entities can be formed manually or automatically using some existing resources.
The corpus of documents may be a collection of electronic documents, such as, but not limited to, a collection of web pages. The documents may be obtained from a repository, such as an electronic database. The electronic database may be an internal database, such as, an intranet of a company, or an external database, such as, Wikipedia. Also, the electronic database may be stored on a standalone personal computer, or spread across a number of computing machines, networked together, with a wired or wireless technology. For example, the electronic database may be hosted on a number of servers connected through a wide area network (WAN) or the internet.
In an embodiment, all possible named entities appearing in a corpus, such as, web pages in an intranet, are identified without concerning their types. The step identifies both simple and complex named entities. To illustrate, simple entities, such as, name of a person (“Jack Sparrow”) and location (“Bangkok”) may be identified. Complex named entities, such as product names (“Compaq Presario 3434 with HP Printer 4565”) and project names (“Entity Extraction Project in ABC Department”) may also be identified, regardless of their types.
In an embodiment, a collocation based method (such as, a method described by D. Downey et al. Locating complex named entities in web text. In Proc. of IJCAI, 2007), may be used to identify named entities. The present embodiment, however, uses a different method to determine the borders of named entities. It uses terms with numbers as the identifier of the named entity borders and a predefined threshold to select the candidates with Symmetric Conditional Probabilities (SCPs) above the threshold as the named entities.
In step 120, a named entity graph is constructed to discover same-type probability between any given pair of named entities, identified in step 110 above. The method step involved in the construction of a named entity graph to discover same-type probability between any given pair of named entities include a number of sub-steps, as illustrated in
Language Model Based Graph Construction
As is known, a graph is generally a collection of points where some points are connected by links. The points are called vertices (or nodes), and the links that connect some pairs of vertices are called edges. The edges may be directed or undirected. One of the main issues in graph construction is to compute the weight of each edge, which encodes the conditional probability of the end node being of the same type as the start node. In an embodiment, a three-stage method is proposed to compute the weight of an edge and construct a named entity graph: (a) create a language model for each named entity (node), (b) compute the conditional probability on the basis of KL-Divergence, and (c) construct the graph using all the named entities
In the first stage, a language model is created for each named entity (122). This is done by retrieving, for each named entity, the documents containing the named entity. The retrieved documents are then combined with snippets around the named entity, in the top ranked documents, into a virtual document. To illustrate, let us take a named entity, “Jack Sparrow”. Let us also assume that an entity search for “Jack Sparrow”, in a corpus of documents, yields a few hundred documents. In the present embodiment, the proposed method would combine the snippets around the named entity (“Jack Sparrow”), in the top ranked documents, into a virtual document. The top ranked documents could be titled, for example, “Pirate”, “Pirates of the Caribbean”, “Johnny Depp”, etc. And, the snippets could be “film”, “movie”, “actor”, “Hollywood”, etc.
The created virtual document reflects the diversity of the snippets where the named entity appears in, and captures the major characteristics of the contexts of the named entity in the snippets. Therefore, the virtual page collection serves as a good collection for building a language model for each named entity. In an embodiment, the language model is constructed using Dirichlet smoothing method.
In the second stage, conditional probability between each given pair of named entities is computed (124). In an embodiment, given a pair of entities, vi and vj, assuming the language models of vi and vj are Li and Lj respectively, on the basis of their KL-Divergence D(Lj|Li), the conditional probability may be computed as:
p(type(vj)=ci|type(vi)=ci)=e−D(L
where type(vi) is the type of the entity
The Kullback-Leibler (KL) divergence is a fundamental equation of information theory that quantifies the proximity of two probability distributions. KL-Divergence is always non-negative, and larger KL-Divergence means smaller conditional probability. When two language models are equal, the conditional probability has the largest value of 1 but the KL-Divergence has the smallest value of 0. As a result, the above equation is a good choice to transfer KL-Divergence into conditional probability.
In the third stage, the edges of a named entity (node) with other named entities (nodes) are established (126). This is done for each named entity. In an embodiment, a brute force method is used to establish the edges from a node to all the other nodes, and assign the corresponding conditional probability as its weight. Each node in the named entity graph is a named entity, and each edge reflects a conditional probability of an end node (named entity) being of same type as a start node (named entity).
Since a usage of such method may result in a complex graph which may prevent efficient computation, a threshold above an empirically selected threshold value is used and only edges with weights above this threshold are preserved.
Simhash Based Model for Accelerating Graph Construction
The selection of only those edges with a threshold value above a certain threshold results in a large amount of optimization. However, calculation of KL-Divergence values between a named entity (node) and the rest is a time-consuming process. To speed up this process, in an embodiment, the method uses simhash to compute the similarities of the virtual documents and filter out named entities (nodes) with lower similarities. The method is based on an observation: for three nodes (named entities) vi, vj and vm with virtual documents pi, pj and pm, let the simhash codes of these virtual pages be shi, shj and shm respectively. If the similarity of pm and pi is less than that of pm and pj, i.e., the Hamming distance between shm and shi is much larger than that of between shm and shj, the KL-Divergence from vm to vi tends to be larger than that from vm to vj, and the conditional probability from vm to vi tends to be smaller than that from vm and vj. The simhash is used to estimate the conditional probability in order to filter out low weight edges in the entity graph, and only compute the weight of the edges between similar nodes.
In an embodiment, a 64-bit simhash code is generated for each entity (node) based on its virtual document. Next, for each node, the Hamming distances between its simhash code and the simhash codes of all the other nodes is computed, and the nodes with Hamming distances more than a predefined threshold are filtered out. Finally, a language model based method is used to compute the weights of the edges between a node and the remaining nodes.
In step 130, the seed entities set is expanded to include some related non-seed entities.
In step 140, a confidence propagation of the seed entities on the named entity graph is performed to predict whether the confidence values of non-seed entities are of the target type. The proposed method proposes a novel algorithm to perform confidence propagation.
Given the expanded seed set S={(s1, c1), . . . , (si, ci), . . . , (sn, cn)}, where si and ci are the index and confidence of the ith seed in V respectively, and the constructed named entity graph G=<V, E> with the transition matrix T where
The following algorithm may be used to perform confidence propagation.
indicates data missing or illegible when filed
A confidence value Confi for ∀viεV is obtained after confidence propagation. Its probability of being the target type c* is measured using:
Depending upon the probability of each named entity, a predefined threshold may be used to determine whether it's of the target type.
The named entity graph 300 consists of eight entities. The eight entities are divided into three types marked with different shades of a color. The conditional probability between a given pair of named entities (nodes) is also shown. On this graph, given an expanded seed set S={(1, 1.0), (4, 0.85)}, and setting αB=0.85, and MB=60, the above described confidence propagation may be invoked to compute the named entity confidence vector
t*=(0.217,0.4346,0.1223,0.1801,0.0024,0.0011,0.0009,0.0001)
and the probability vector
p=(0.499,1,0.281,0.414,0.006,0.003,0.002,0.0002)
Using any threshold value between 0.006 and 0.281, the proposed method would be able to identify that the first four nodes are of the target type.
The storage medium 420 (such as a hard disk) stores a number of programs including an operating system, application programs and other program modules. A user may enter commands and information into the computer system 400 through input devices, such as a keyboard 450, a touch pad (not shown) and a mouse 460. The monitor 440 is used to display textual and graphical information.
An operating system runs on processor 410 and is used to coordinate and provide control of various components within personal computer system 400 in
It would be appreciated that the hardware components depicted in
Further, the computer system 400 may be, for example, a desktop computer, a server computer, a laptop computer, or a wireless device such as a mobile phone, a personal digital assistant (PDA), a hand-held computer, etc.
The embodiment described provides an effective way of extracting named entities given a corpus of documents. Embodiments address the problem of extracting any types of entities from a general organization's web pages with minimum cost. The proposed weighted named entity graph is capable of encoding the complex relationships between the types of each named entity and others, so the propagation of seed confidences on the graph can make up the lack of the web-scale redundancy, and can support effective organization-scale extraction. Further, the confidence propagation on the named entity graph can be transformed to efficient matrix computation, which can support efficient extraction on a large-scale corpus.
It will be appreciated that the embodiments within the scope of the present invention may be implemented in the form of a computer program product including computer-executable instructions, such as program code, which may be run on any suitable computing environment in conjunction with a suitable operating system, such as, Microsoft Windows, Linux or UNIX operating system. Embodiments within the scope of the present invention may also include program products comprising computer-readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, such computer-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM, magnetic disk storage or other storage devices, or any other medium which can be used to carry or store desired program code in the form of computer-executable instructions and which can be accessed by a general purpose or special purpose computer.
It should be noted that the above-described embodiment of the present invention is for the purpose of illustration only. Although the invention has been described in conjunction with a specific embodiment thereof, those skilled in the art will appreciate that numerous modifications are possible without materially departing from the teachings and advantages of the subject matter described herein. Other substitutions, modifications and changes may be made without departing from the spirit of the present invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN10/72235 | 4/27/2010 | WO | 00 | 12/17/2012 |