Entity recognition is a common task in information processing. Entity recognition is typically performed on unstructured documents, such as text documents collected from the web. The entity recognition process seeks to identify named entities mentioned in the text. An entity may be anything with a name—e.g., a person, a city, a famous work of art, etc.
A typical entity recognizer uses a knowledge base of entities, and attempts to recognize those entities in a document that is being examined. The knowledge base contains a list of known entities, a canonical name for each entity (which distinguishes that entity from other entities in the knowledge base), and a set of one or more surface forms for each entity. The surface forms are the forms that are likely to be encountered in a document, and a given entity may have more than one surface form. For example, an entity might be the person whose name is “John Smith”. The canonical name for that entity might be “John Q. Smith, Jr.”, and the various surface forms of his name might be “John Smith”, “J. Smith”, “J. Q. Smith”, etc. Thus, an entity recognizer might look for these surface forms in the document. If one of these surface forms is observed in the document, the entity recognizer may declare that the entity “John Q. Smith, Jr.” has been observed in the document. Some sophisticated entity recognition techniques may take context into account when determining whether a match to one of the surface forms has been found (where context may refer to surrounding words, the title of the document, or any other information).
One issue that arises in entity recognition is that of recognizing cultural entities, such as the names of movies, video games, books, etc. Person names and place names tend to have a distinctive lexicon—e.g., the word “Fred” generally has no meaning other than as a person's name. On the other hand, cultural entities generally have names that are ambiguous in the sense that they might refer to a cultural entity or might simply be words used in their normal sense. For example, the word “up” might refer to the name of a movie, the name of a video game based on the movie, a music album that is unrelated to either the movie or the video game—or might simply be used as an English adjective. Thus, identifying and disambiguating cultural entities presents a challenge.
Entities may be identified and disambiguated by using knowledge about the entities. Knowledge about cultural entities can be mined from existing resources. For example, there are databases of information about movies, books, video games, etc., from which concepts associated with the entity name can be gleaned. A movie has a set of characters, a set of actors, a genre, etc., and this information can be mined from existing resources. Similarly, video games have characters (and sometimes human actors) associated with them, and this information can be mined from existing resources. Using this information, a concept graph for an entity may be built. The concept graph contains entities (e.g., the name of a movie, the name of a character in the movie, the name of an actor in the movie, etc.), and the relationships between these entities. If an ambiguous term that might (or might not) refer to a cultural entity, that term can be compared to other entities that appear in a concept graph. If the ambiguous term refers to a particular cultural entity, then it is likely that other terms from the concept graph will appear in the ambiguous entity's context. Additionally, words relating to a certain type of cultural entity might tend to appear near entities of that type. For example, “up” may be both a movie and a video game, but terms like “play,” “high score,” “Xbox,” etc., are more likely to appear near the word “up” when that term refers to the video game. In this way, it can be determined whether a given term refers to a cultural entity, and, if so, which type of cultural entity the term refers to.
Relationship in a concept graph can be measured to determine a degree of affinity, or relatedness, among concepts. The significance of a particular degree of relatedness can be determined using adaptive machine learning techniques. For example, concepts in a concept graph may be assigned affinity measures such as one, two, three, etc. The higher the affinity measure, the less related two concepts may be. Different types of measures of relatedness can be defined, and the different measures can be used with different disambiguation algorithms. Disambiguation may be performed by a parameterized classifier whose parameters specify how the relatedness of concepts in the concept graph affect the disambiguation decision. Machine learning techniques may be used to optimize the parameters in order to assign the appropriate significance to a given degree of relatedness among concepts.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Entity recognition is a process in which text is evaluated to identify and classify atomic elements. For example, the phrase “John Smith” might refer to a specific person. An entity recognition process may detect the presence of that phrase in a text, and may recognize that the phrase refers to a specific person.
In the simplest examples of entity recognition, a specific phrase unambiguously identifies a specific entity. In such an example, “John Smith” would refer to a specific person, and not to any other person or entity. In reality, entity detection is rarely this simple. A given person's name may have several different surface forms—e.g., “John Smith”, “John Q. Smith”, and “Johnny Smith” all may refer to the same person. Or, the same phrase may refer to different entities—e.g., there may be several people named “John Smith”, in which case the phrase “John Smith”, when detected in a text, has an ambiguous meaning. Various techniques have been devised to help to disambiguate entities.
One vexing problem in entity recognition is disambiguation of cultural entities. Cultural entities are entities whose meaning arises from popular culture, such as the titles of movies, books, video games, etc. One problem that arises is that, in some cases, cultural entities lack distinctness, which makes them difficult to distinguish from ordinary words. For example, in 2009 a movie named “Up” was released. However, “up” is a common English word. It is easy to use standard pattern matching techniques to detect the presence of the word “up” in a text. It is more difficult to determine whether that word is being used in its normal English sense, or as the title of a movie. Another problem that arises is that the same name may refer to several different entities. For example, the phrase “The Lord of the Rings” refers to a set of books, a set of movies, a set of video games, and various other products. Merely recognizing the phrase “The Lord of the Rings” in a text does not unambiguously identify which entity is being referenced.
The subject matter described herein provides a way to extract cultural entities from text. The techniques herein may be used to extract any type of cultural entities (entities related to movies, books, video games, music, television, etc.) from any type of text. These techniques use contextual clues to determine whether a particular phrase refers to a cultural entity, and what type of entity the phrase refers to. Information concerning cultural entities may be mined from readily available data sources, and the mined information may be used to recognize entities. Databases of movies are available on the web. These databases could be used to identify the titles of movies, as well as the names of actors and characters in the movies, the genre of the movie, etc. For example, the movie “Up” has characters named Russell and Carl. If the word “up” appears near these names, that fact suggests that the word “up” is referring to the title of a movie rather than an ordinary English adjective. A name like “The Lord of the Rings” is highly distinctive, and it is unlikely that this phrase would refer to anything other than a cultural entity. However, determining whether it refers to a book, a movie, a video game, etc. is more challenging, but context can be used to make that determination. For example, if the phrase “The Lord of the Rings” occurs in proximity to words that suggest video games (e.g., “play”, “scores”, “Xbox”, etc.), this fact suggests that the phrase refers to a video game. Other phrases (e.g., “film,” “academy award,” “theater,” “rated PG,” etc.), may suggest that the “The Lord of the Rings” refers to a movie.
Various algorithms described herein may be used to determine when a word or phrase refers to a cultural entity, and also to determine which entity the word of phrase refers to when different types of cultural entities have the same name. Additionally, machine learning techniques may be used to tune the algorithms in order to affect the way that they use information about cultural entities to disambiguate words or phrases.
Since the techniques described herein can work with any type of semantic resource, these techniques may provide the following aspects:
Turning now to the drawings,
Concept graph 200 provides a simple example of one way to model a particular type of cultural entity. However, this example shows that a cultural entity may be described both by its name (“Up”, in this example), as well as by its relationship to other entities (e.g., characters, actors, ratings, genres, etc.).
Returning now to
At 108, the context of a candidate entity is examined to determine whether it contains other entities that appear in the candidate entity's concept graph. Each node in the graph defines an entity that can be recognized in a document. In the example of
What follows is a description of the particular way(s) that entities in the concept graph—as well as other information—are used to disambiguate candidates. Using the techniques described below, it can be determined whether a candidate refers to a cultural entity, and which cultural entity it refers to. For example, techniques that follow may be used to determine whether the word “up” in a document refers to an ordinary word or a cultural entity. If it is found to refer to a cultural entity, these techniques may be used to determine which cultural entity it refers to. For example, the techniques described herein may be used to determine whether the word “up” refers to a movie by that name, a video game based on the movie, a 2002 Peter Gabriel musical album by that name, or just the English adjective “up”.
In order to understand how to recognize and disambiguate cultural entities, consider the following example. Suppose one is looking for references to video games. An entity extractor that is examining a document may see the word “Black,” which is known to be identical to the name of a video game, although that word could refer to a large number of things of things other than the video game of that name. Since the nature of the observed use of the word “Black” is ambiguous, it is a candidate in the sense that it might refer to a video game. However, it is known that video games are things of a certain type, and that certain actions (e.g., play, buy, win, lose, etc.) are associated with things of that type. Therefore, if actions such as win, lose, etc., are mentioned somewhere near the word “Black” (or, more generally, in the context of that word), then the word “Black” is more likely to be a mention of a game than if those actions had not appeared near the word “Black.” Likewise, other facts may be present that suggest that the word “Black” refers to a video game of that name. Video games tend to be purchased at certain stores with distinctive names (e.g., “GameStop”, “EB Games”, etc.), tend to be played on specific consoles (e.g., “Xbox”, “PS3”, etc.), and tend to be discussed on specific web sites devoted to video games. Thus, if this type of information appears in the context of the word “Black”, this fact increases the probability that the word “Black” refers to a video game instead of referring to something else. Information such as the consoles on which games are played, stores in which they are sold, the names of video game blogs, actions associated with video games, and other information can be mined from an appropriate semantic resource, such as a Wikipedia article on video games. Additionally, there are semantic resources from which concepts relating specifically to the “Black” video game can be mined (e.g., the names of characters or places that appear in the game), and the presence of those concepts in the context of the word “black” may suggest that an instance of the word “black” refers to the video game of that name.
Semantic resources, such as the Wikipedia pages or other web pages mentioned above, may be mined in order to build a concept graph.
The following is a description of how graphs that have been built may be used to recognize cultural entities. Let the knowledge about concepts in selected domains be defined by ontology comprising the set C of concepts, the set R of relations (each relation being defined over two concepts, and the set A of attributes, each attribute being defined over a concept.) The ontology may be represented in a DAG, with concepts are denoted by nodes in the graph and relations as edges relating one concept to another. Nodes in the graphs are the entities for extraction, each associated with a weight α, where 0≦α≦1, where α is a measure of distinctiveness of the concept in reference to the ontology and in reference to other objects in the world. For example, the word “they” may be the name of a cultural entity, but it also appears frequently as an ordinary English pronoun. Therefore, the word “they” is a highly ambiguous cultural reference, so such a word could be assigned a very low α value. On the other hand, the word “Xbox” is rarely used to refer to anything other than a video game console, which makes it a very unambiguous cultural reference. Therefore, “Xbox” could be assigned a high α value.
Let “-” be a binary operator that is applied to two nodes and returns the minimum number of edges in sequence connecting the nodes. For examples, if ci and cj are nodes, then ci−cj=n, where n is the minimum number of edges that one would have to follow to travel from ci to cj. For every pair of concepts ci and cj, one may compute the “degree of affinity,” affin(ci, cj), representing degree of relatedness. There are two such types of affinity, defined by equations (1) and (2):
affin1(ci,cj)=ci−cj if such exists (1)
affin2(ci,cj)=lca
In equation (2),
Equations (1) and (2) represent two notions of affinity between concepts in a graph. These different concepts of affinity are used in two algorithms described below. Intuitively, equation (1) is a simple distance between concepts, based on the number of nodes that one has to pass through to get from concept ci to concept cj—i.e., the number of edges that would be traversed on a path between concepts ci and cj. Equation (2), on the other hand, places significance on specific kinds of relations that have the capacity to indicate strong relatedness to other concepts. For example, relations of the form “type of” (concept ci is a type of concept cj), or “part of” (concept ci is a part of concept cj) tend to indicate a particular type of relatedness among concepts beyond the mere proximity that is measured by equation (1).
In
In order to apply equation (2), first level affinity to the “X” node is initially determined by identifying those nodes that can be reached from “X” in one hop. Observing the direction of the arrows, the only three nodes that can be reached from “X” in one hop are the three nodes that are marked with a “1”. Other nodes are then assigned affinities greater than 1 as follows. A node that can reach the “X” node through a single directed edge in
These different affinity measures may be used in disambiguating candidate entities. For example, if a candidate entity is near another entity whose affinity in a particular graph is one, that fact may strongly indicate that the candidate entity is the cultural entity that the graph describes. If the candidate entity is near another entity whose affinity is two, this fact may also indicate that the candidate entity is the cultural entity described in the graph—although the presence of an affinity two entity does not suggest the identity of the candidate as strongly as an affinity one entity does.
In order to use a concept graph to recognize cultural entities in a document, the document is examined using an n-gram sliding window procedure to obtain partially matching candidate sections in the document. The system may consider partial matches in order to account for different surface representations of the same concept. For example, the canonical name for an entity might be “The Lord of the Rings”, although the partial match “Lord of the Rings” might be accepted as a candidate.
In order to effectively support wide range of cultural entities in a non-scoped environment, i.e. when the entities mentioned in text have no domain constraints, a system first attempts to distinguish between candidates mentioned in reference to existing knowledge and candidates referencing other objects in the world. For example, a text section might mention “The tenant”, and a system may attempt to determine if these words refer to a movie of that name, or to a person who rents an apartment. One way to perform this recognition is built on learning a prediction model which relies on semantic information within context as an indicator. The prediction model uses features corresponding to three dimensions: estimation of the distinctiveness of a candidate entity (e.g., the a value mentioned above), the similarity between a candidate section in text and the corresponding entity in the graph (via string similarity matching), and the degree of semantic support derived from entities in the graph that are present in context of the candidate.
Retrieval of related concepts from the concept graph can be vulnerable to varying degrees of modeling sparseness. For example, different concepts and their relationships may be defined with different degrees of detail. To address this issue, we also consider an adaptive scheme in which a favorable neighborhood distance for a set of concepts is computed based on classification feedback. In other words, we have a classifier that responds to input from the concept graph as well as a neighborhood distance, and which performance is used to identify constructive neighborhood to the set of concepts.
More formally, we have a feature space X, a binary target space Y={−1, +1} and a set of training examples (xi,yi)|xiεXi, yiεY, i=1, . . . , N, produced for concepts in a multi-domain ontology, once. Let the neighborhood distance
The following is an example of how disambiguation may be performed using information contained in concept graphs. Consider, for example, the text section “The Lord of the Rings”, which may refer to, say, twelve different cultural entities (e.g., several movies, several video games, several books, etc.). In order to disambiguate this candidate, the following approaches may be used.
The first approach (referred to herein as “Disambiguation I”) emphasizes heuristics dealing with the particular arrangement and characteristics of the ambiguous sections—for equally supported entities it favors the entities more similar to a section, and of those it favors a candidate associated with a longer section. The second approach (referred to herein as “Disambiguation II”) makes use of the notion of distance, both in the document and the concept graph. More distant nodes in the graph are considered less related, as are more distant supportive evidences within the text.
Disambiguation I works as follows. Let Ni be the set of entities in the neighborhood of entity ci in the concept graph, simi the similarity between the section and ci, secSizei the section size referring to ci, and the set A={i . . . k . . . j} the conflicting candidates.
Define support for entity as
Let B={ . . . m . . . }⊂A define the set of elements that satisfy max(simm)±δsim, where δsup and δsim are small fudge values. Return an entity ci from the set C that maximizes secSizei.
Disambiguation II works as follows. Define the distance di,j0 between two entities ci and cj in a graph as follows:
where
Then return ci that maximizes Σj≠idi,j0dj→it.
Computer 700 includes one or more processors 702 and one or more data remembrance components 704. Processor(s) 702 are typically microprocessors, such as those found in a personal desktop or laptop computer, a server, a handheld computer, or another kind of computing device. Data remembrance component(s) 704 are components that are capable of storing data for either the short or long term. Examples of data remembrance component(s) 704 include hard disks, removable disks (including optical and magnetic disks), volatile and non-volatile random-access memory (RAM), read-only memory (ROM), flash memory, magnetic tape, etc. Data remembrance component(s) are examples of computer-readable storage media. Computer 700 may comprise, or be associated with, display 712, which may be a cathode ray tube (CRT) monitor, a liquid crystal display (LCD) monitor, or any other type of monitor.
Software may be stored in the data remembrance component(s) 704, and may execute on the one or more processor(s) 702. An example of such software is cultural entity extraction software 706, which may implement some or all of the functionality described above in connection with
The subject matter described herein can be implemented as software that is stored in one or more of the data remembrance component(s) 704 and that executes on one or more of the processor(s) 702. As another example, the subject matter can be implemented as instructions that are stored on one or more computer-readable storage media. (Tangible media, such as an optical disks or magnetic disks, are examples of storage media.) Such instructions, when executed by a computer or other machine, may cause the computer or other machine to perform one or more acts of a method. The instructions to perform the acts could be stored on one medium, or could be spread out across plural media, so that the instructions might appear collectively on the one or more computer-readable storage media, regardless of whether all of the instructions happen to be on the same medium.
Additionally, any acts described herein (whether or not shown in a diagram) may be performed by a processor (e.g., one or more of processors 702) as part of a method. Thus, if the acts A, B, and C are described herein, then a method may be performed that comprises the acts of A, B, and C. Moreover, if the acts of A, B, and C are described herein, then a method may be performed that comprises using a processor to perform the acts of A, B, and C.
In one example environment, computer 700 may be communicatively connected to one or more other devices through network 708. Computer 710, which may be similar in structure to computer 700, is an example of a device that can be connected to computer 700, although other types of devices may also be so connected.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.