This document relates generally to identifying factual information and more particularly to computer implemented systems and methods for identifying factual information in a written document.
Automated scoring of essays involves evaluating various aspects of the essay itself including, the grammar, usage, mechanics, organization and substantive content. For assessment of content, the focus has traditionally been on the topical appropriateness of the vocabulary. Recently, other aspects such as detection of sentiment or figurative language have also been considered. Although it is well known that a misleading premise, insufficient factual basis or an example that contradicts the reader's knowledge all detract from the quality of an essay, the effect that factual information in an essay has on the overall quality of the essay has not been addressed. It is believed that the use of factual information in an essay is correlated to the overall quality of the essay. Accordingly, identification and verification of factual information is important in a variety of contexts, including the scoring of essays and the like.
In accordance with the teachings herein, systems and methods are provided for identifying factual information in a written document. For example, a computer implemented method for identifying factual information in a written document may include identifying one or more named entities in the written document and identifying one or more noun phrases in the written document that are associated with a corresponding one or more named entity. Using a noun entity and a noun phrase, at least one query may be built by combining the named entity with a respective noun phrase. Since it is unknown at this stage whether the query corresponds to a fact, the query corresponds to an assertion—i.e., a statement that is believed to correspond to a fact but has not been verified as a fact. The query is submitted for comparison with a fact repository and an assessment is made as to whether the query presents a factual assertion—i.e., whether the assertion represented in the query is present in the fact repository. If it is present, a match is returned.
As another example, a system for identifying factual information in a written document may include one or more data processors and one or more computer readable mediums encoded with instructions for commanding the one or more data processors to perform processing steps. In the steps, one or more named entities in the written document may be identified and one or more noun phrases in the written document that are associated with a corresponding one or more named entity may also be identified. Using a noun entity and a noun phrase, at least one query may be built by combining the named entity with a respective noun phrase. Since it is unknown at this stage whether the query corresponds to a fact, the query corresponds to an assertion—i.e., a statement that is believed to correspond to a fact but has not been verified as a fact. The query is submitted for comparison with a fact repository and an assessment is made as to whether the query presents a factual assertion—i.e., whether the assertion represented in the query is present in the fact repository. If it is present, a match is returned.
As a further example, a computer readable medium may be encoded with instructions for commanding one or more data processors to perform processing steps. In the steps, one or more named entities in the written document may be identified and one or more noun phrases in the written document that are associated with a corresponding one or more named entity may also be identified. Using a noun entity and a noun phrase, at least one query may be built by combining the named entity with a respective noun phrase. Since it is unknown at this stage whether the query corresponds to a fact, the query corresponds to an assertion—i.e., a statement that is believed to correspond to a fact but has not been verified as a fact. The query is submitted for comparison with a fact repository and an assessment is made as to whether the query presents a factual assertion—i.e., whether the assertion represented in the query is present in the fact repository. If it is present, a match is returned.
In still further examples, noun phrase may be identified from the same sentence as the corresponding named entity and/or the noun phrase may by identified from a neighboring sentence to the named entity. For example, the noun phrase may be identified from a neighboring sentence if the corresponding named entity is a person and the neighboring sentence, from which the noun phrase is identified, includes at least one of an appropriate personal pronoun or a portion of the named entity.
In still further examples, the noun phrases may be identified using a dependency path of sentence structure. For example, the dependency path may be an upward step followed by between one and four downward steps (e.g., 1, 2, 3, or 4 downward steps).
In still further examples, the process may further comprise building variants of the query. For example, the variant of the query may be constructed by modifying the noun phrase. For example, a variant may be created by the removal of determiners and/or pre-modifiers from the noun phrase. A variant may be created by modifying the noun phrase to only include a sequence of nouns ending with the head noun. Another variant may be a noun phrase that is modified such that it comprises only the word from the identified noun phrase that has the lowest frequency of occurrence. A further example of a variant includes a noun phrase that is modified such that it comprises only the rightmost capitalized word of the identified noun phrase, if the identified noun phrase includes capitalized parts.
In still further examples, the process may further comprise filtering matches to eliminate undesired matches. For example, the match may be filtered if the matched noun phrase in the fact repository comprises modal or hedged predicates. Additionally, the match may be filtered if the named entity or the noun phrase in the fact repository is more specific than the named entity or the noun phrase in the query. In a further example, the match may be filtered if any of a plurality of conditions are met. Such conditions may include, for example: (i) if a capitalized word follows the named entity or noun phrase in the fact repository but is not present in the portion of the written document from which the named entity or noun phrase are identified; (ii) if more than one capitalized or rare words precedes the named entity or noun phrase in the fact repository but is not present in the portion of the written document from which the named entity or noun phrase are identified and the capitalized or rare words are not honorifics; (iii) if the named entity or noun phrase in the fact repository is longer than eight words; or (iv) if more than three words follow the named entity or noun phrase in the fact repository. Additionally, the match may be filtered if the ratio of negative to positive predicates among a plurality of matches is greater than a predetermined threshold.
As discussed above, identification and verification of factual information may be important in a variety of contexts including the scoring of essays and the like. A fact can be understood in a number of different manners. For example, in the context of argumentation (e.g., an argumentative essay) the notion of a fact may be characterized as data which is common to several beings and for which there is agreement as to the correctness of that data. In some examples, a fact can be distinguished from a presumption which may be a statement about what is normal and/or likely. In particular, this distinction in the scope of required agreement may be related to the referential device used in a particular statement. If the reference is more rigid, that is, less prone to change in time and to indeterminacy of the boundaries, the scope of necessary agreement is likely to by more precise. For example, statements made in connection with proper names may be more rigid than others (e.g., “Barack Obama” selects for one, and the same, person in 2010 and 1990 but “current U.S. president” selects for different people at different times).
In addition to identification of facts, it is also important to be able to verify that the identified statements are actually true. As discussed throughout this disclosure, the identified statements may be compared against a fact repository. For example, the fact repository may be an encyclopedia, the world wide web, based on Open Information Extraction (OIE), and/or the TextRunner system.
In addition to identifying a named entity 110, the process continues with the identification of a corresponding noun phrase (NP) 115. A noun phrase is generally a word or phrase which includes a noun and the modifiers which distinguish it. Selection of the noun phrase may be based on, for example, a grammar-based approach. For example, noun phrase may be identified using a dependency path. In an example, the dependency paths may be obtained from the Stanford Dependency Parser. In particular, the dependency path may be an upward step followed by between one and four downward steps. For example, the it is believed that the most prolific family of paths starts with an upward step and then between 1-4 downward steps. The first upward step may connect the named entity to the predicate of which it is an argument. The downward step(s) may connect the predicate to the head of another argument (e.g., noun phrase) or to an argument's head's modifier. Some examples of statements with different dependency paths include: “a Nobel Prize in a science field” (one downward step); “Chaucer, in the 14th century . . . ” (one downward step); “the prestige of the Nobel Prize” (one upward step); “Kidman's talent” (one upward step); “Kroemer received the Nobel Prize” (one upward step followed by one downward step); and “Kroemer received the Nobel Prize for his work on the Heterojunction Bipolar Transistor” (one upward step followed by two downward steps).
In an example, the noun phrase may be contained within the same sentence as the corresponding named entity or it may be located in a neighboring sentence to the one with the named entity. For example, the noun phrase may be identified from a neighboring sentence if the corresponding named entity is a person and/or the neighboring sentence includes at least one of an appropriate personal pronoun and/or a portion of the named entity (e.g., just a last name of a person). In an example, the process may confirm that the gender of the pronoun matches that of the named entity and/or if the gender of the named entity cannot be confirmed, the process may not expand identification of the noun phrase into a neighboring sentence.
In an example, the written document that the named entity and noun phrase are identified from is e.g., a test taker's essay and/or the identification of factual information is utilized in the scoring of the test taker's essay.
The named entity and the noun phrase are used to build a query 120. For example, the query may be structured as a 3-tuple query. For example, the structure of the query may be <NE, ?, NP>. In examples, the “?” may be the predicate that links the named entity with the noun phrase.
The query is submitted for comparison to a fact repository 125. For example, the fact repository may be an encyclopedia, the world wide web, based on Open Information Extraction (OIE), and/or the TextRunner system. The comparison of the query with the fact repository assesses whether the query presents a factual assertion 130. In particular, the query is built with the belief that the assertion is factual but it is unknown whether the assertion is actually true. By comparing the query to the fact repository, the process determines whether there is a match within a data set that is believed to contain facts. If the query does match corresponding information within the fact repository, a match is returned 135. For example, the match may require that the fact repository contain a corresponding named entity and noun phrase to the ones in the query. In another example, the named entity may need to be contained within the fact repository but the noun phrase may not need to be exactly present. In another example, neither the named entity of the noun phrase in the fact repository would need to be exactly matched to the query as long at some predetermined criteria is met. In yet another example, the predicate in the query may or may not need to be matched.
After completing the matching process for the identified named entity and corresponding noun phrase, the process determines whether there are any additional named entities and/or noun phrases 140. If there are, the process begins again and if there are not, the process terminates 145.
As illustrated in
In another example, the noun phrase can be modified to create a query variant that comprises a sequence of nouns ending with the head noun 220. For example, using the same example above, the noun phrase may be modified to “photograph.”
In another example, the noun phrase can be modified to create a query variant that comprises only the word from the noun phrase that has the lowest frequency of occurrence. For example, capitalized words may be given the lowest frequency so that if the noun phrase contained any capitalized word the variant might contain the left most capitalized word (e.g., the first capitalized word) or if an out of vocabulary word was present in the noun phrase, the out of vocabulary word. Accordingly, in an example, if the noun phrase contained a name, the name may be split such that only the first name is taken in the variant. For example, in the noun phrase “that author Orhan Phamuk” the variant noun phrase may be “Orhan.” If no capitalized word exists, the variant may simply select the rarest word from within the phrase. For example, if the noun phrase was “category 3 hurricane” the variant noun phrase may be “hurricane.”
In another example, the noun phrase can be modified to create a query variant that comprises only the rightmost capitalized word, if the noun phrase includes capitalized parts. For example, if the noun phrase was “the actress Nicole Kidman” the variant noun phrase would be “Kidman.” This variant may serve to select last names as a potential complement to the variant discussed above which potentially selects only first names.
Although each of the four examples of variants are shown in
Matches may be filtered if the fact (e.g., named entity and/or noun phrase) in the fact repository comprises modal or hedged predicates 310. For example, matches based on predicates such as “might turn out to be” or “possibly attended” may be filtered out. Similarly, matches based on future tense predicates may be filtered out as well.
Matches may be filtered if the fact in the fact repository is more specific than the one in the query 320. For example, the match may be filtered if any of the following conditions are met. The match may be filtered if a capitalized word follows the fact in the fact repository but is not present in the sentence (or neighboring sentence) from which the query was identified 330. The match may be filtered if more than one capitalized or rare words precedes the fact in the fact repository but is not present in the sentence (or neighboring sentence) from which the query was identified and the capitalized or rare words are not honorifics 340. The match may be filtered if the fact in the fact repository is longer than eight words 350. The match may be filtered if more than three words follow the fact in the fact repository 360.
Matches may also be filtered if the ratio of negative to positive predicates among a plurality of matches is greater than a predetermined threshold 370. For example, a query such as <Barack Obama, ?, US citizen> may be filtered out based on the following pattern of matches:
Additionally, matches may be filtered if the matches themselves reflect a lack of consensus and/or an argumentative statement.
Although each of the examples of filters are shown in
Examples have been used to describe the invention herein and the scope of the invention may include other examples.
A disk controller 460 interfaces one or more optional disk drives to the system bus 452. These disk drives may be external or internal floppy disk drives such as 462, external or internal CD-ROM, CD-R, CD-RW or DVD drives such as 464, or external or internal hard drives 466. These various disk drives and disk controllers may be optional devices.
Each of the element managers, real-time data buffer, conveyors, file input processor, database index shared access memory loader, reference data buffer and data managers may include a software application stored in one or more of the disk drives connected to the disk controller 460, the ROM 456 and/or the RAM 458. The processor 454 may access each component as required.
A display interface 468 may permit information from the bus 452 to be displayed on a display 470 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 472.
In addition to the standard computer-type components, the hardware may also include data input devices, such as a keyboard 473, or other input device 474, such as a microphone, remote control, pointer, mouse and/or joystick.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
While this document uses examples to disclose the inventions described herein, it will be obvious to those skilled in the art that patentable scope of the invention may include other examples as well. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
This application claims the benefit of U.S. Provisional Application No. 61/622,819 filed on Apr. 11, 2012, the entire contents of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61622819 | Apr 2012 | US |