The present disclosure relates to techniques for providing quotations and, in particular, to methods, systems, and techniques for extracting, attributing, indexing, and searching for quotations from text documents using natural language processing based techniques.
Embodiments described herein provide enhanced computer- and network-based methods and systems for providing quotations. Example embodiments provide a content recommendation system (“CRS”) configured to recommend content items such as entity information, documents, video, advertisements, product information, and the like. As part of a content recommendation process, in some embodiments, the CRS is configured to provide quotations by extracting quotations from text documents and providing access to the extracted quotations in response to search requests received from users. Extracting a quotation from a text document includes identifying the quotation in the text document, as well as information about the identified quotation. Information about the identified quotation may include a textual or other representation of the quotation (e.g., what was said), an entity to which the quotation is attributed (e.g., a speaker of the quotation), entities that are referenced by the quotation (e.g., the subject of the quotation), and/or relationships between entities referenced by the quotation, or other context. Entities include people, places (e.g., locations), organizations (e.g., political parties, corporations, groups), events, concepts, products, substances, and the like. Table 4, below, includes a list of example entity types. Fewer or more entity types may be available. Information about a quotation may also or instead include various types of meta-information, such as the title of an article or document in which the quotation appears, a publication date, a credibility indication, and the like.
The CRS is further configured to store (e.g., index) the extracted quotations and corresponding information, such as quotation speaker and subject entities. Information about entities (e.g., entity name, entity type, and the like) and the relationships between them are represented and stored by the CRS, such that quotations can be accessed based on the entities associated therewith and based upon semantic relationships with those entities. In one embodiment, as discussed further below, the indexed quotations are recognized and stored according to natural language processing (“NLP”) techniques such that details of the quotations and their context can be “understood” by the CRS, beyond what keyword based pattern matching will yield.
For example, in some embodiments, once the CRS has extracted quotations from a corpus of text documents, a user can interact with the CRS (e.g., via an interactive user interface) to search for quotations. Because the CRS has indexed not only quotation text, but various types of information about the quotations (e.g., speaker entities, subject entities, categorizations of the speakers, parts-of-speech related information, etc.), the CRS can provide search functionality that can be utilized to request quotations matching a rich set of search expressions. In one example embodiment, in addition to requesting all quotations by a particular speaker, a user can request all quotations by a particular speaker (e.g., Barack Obama) about a particular subject (e.g., health care).
In addition, in some embodiments, entities have one or more associated facets, which include finely grained characteristics of entities such as entities, types, and/or characteristics. Example facets include actor, politician, athlete, nation, drug, sport, automobile, and the like. In such embodiments, users can search for quotations based on facet specifications. For example, a user can request all quotations by a particular class of speaker (e.g., a politician) about a particular class of subject (e.g., sports). Table 5, below, includes a list of example facets for the various entity types used in one embodiment. Other expressive search functionality is contemplated and supported by various embodiments, as discussed below.
The entity and relationship identifier 212 receives content information from the content ingester 211 and identifies entities and relationships that are referenced therein. Various automatic and semi-automatic techniques are contemplated for identifying entities within content items. In one embodiment, the identifier 212 uses natural language processing techniques, such as parts of speech tagging and relationship searching, to identify sentence components such as subjects, verbs, and objects, and to identify and disambiguate entities. Example relationship searching technology, which uses natural language processing to determine relationships between subjects and objects in ingested content, is described in detail in U.S. Pat. No. 7,526,425, filed Dec. 13, 2004, and entitled “METHOD AND SYSTEM FOR EXTENDING KEYWORD SEARCHING FOR SYNTACTICALLY AND SEMANTICALLY ANNOTATED DATA” issued on Apr. 28, 2009, and example entity recognition and disambiguation technology is described in detail in U.S. patent application Ser. No. 12/288,158, filed Oct. 15, 2008, and entitled “NLP-BASED ENTITY RECOGNITION AND DISAMBIGUATION,” both of which are incorporated herein by reference in their entireties. Amongst other capabilities, the use of relationship searching, enables the CRS 200 to establish second order (or greater order) relationships between entities and to store such information in the data store 217.
For example, given a sentence such as “Sean Connery starred in Goldfinger,” the identifier 212 may identify “Sean Connery” as the sentence subject, “starred” as the sentence verb (or action), and “Goldfinger” as the sentence object, along with the various modifiers present in the sentence. These parts-of-speech components of each sentence, along with their grammatical roles and other tags may be stored in the relationship index 217c, for example as an inverted index as described in U.S. Pat. No. 7,526,425. As part of the indexing process, the CRS recognizes and disambiguates entities that are present in the text. Indications of these disambiguated entities are also stored with the sentences information, when the sentence contains uniquely identifiable entities that the CRS already knows about. These entities are those that have been added previously to the entity store 217b. In some cases, the indexed text contains subjects and objects that indicate entities that are not necessarily known or not yet disambiguated entities. In this case the indexing of the sentence may store as much information as it has in index 217c, but may not refer to a unique identifier of an entity in the entity store 217b. Over time, as the CRS encounters new entities, and in some cases with the aid of manual curation, new entities are added to the entity store 217b. In the above example, “Sean Connery” and “Goldfinger” may be unique entities already known to the CRS and present in the entity store 217b. In this case, their identifiers will be stored along with the sentence information in the relationship index 217c. The identified verbs also define relationships between the identified entities. These defined relationships (e.g., stored as subject-action-object or “SAO” triplets, or otherwise) are then stored in the relationship index 217c. In the above example, a representation of the fact that the actor Sean Connery starred in the film Goldfinger would be added to the relationship index 217c. In some embodiments, the process of identifying entities may be at least in part manual. For example, entities may be provisionally identified by the identifier 212, and then submitted to curators (or other humans) for editing, finalization, review, and/or approval.
The entity and relationship identifier 212 may determine various other kinds of information about entities and relationships. In one embodiment, the identifier 212 also determines facets, which include finely grained characteristics of entities, such as entity types, roles, qualities, functions, and the like. For example, the entity Sean Connery may have various associated facets, including that of actor, producer, knight, and Scotsman. The facet information for entities may also be stored in the entity store 217b.
The quotation extractor 213 extracts quotations based on information about content items stored in the data store 217 by the ingester 211 and identifier 212. Extracting quotations from a document may include performing natural language processing upon the document, including linguistic and/or semantic analysis to perform functions such as paragraph and/or sentence detection, parts-of-speech tagging, lexical analysis to detect phrases, semantic analysis to determine how words are used in the document, and the like. In addition, extracting quotations may include identifying entities and associated relationships within the document, disambiguating entities, detecting quotation boundaries and/or verbs, and the like. Further, extracting quotations may include, storing and/or indexing the detected quotations in a data store. In some embodiments, each quotation is represented by the extractor 213 as a triple that includes a speaker, a verb, and a quote, and this triple is recorded in one or more indexes (for example, inverted indexes) stored in the quotation store 217d. Additional information may be stored in association with an extracted quotation, such as entity information (e.g., one or more entity identifiers), speaker modifiers (e.g., terms modifying the quotation speaker), action modifiers (e.g., terms modifying the quotation verb), and the like. Additional techniques for quotation extraction are discussed with reference to
The quotation locator 214 provides access to stored (e.g., indexed) quotations based on a received quotation request from a user 202 or some other source. In one embodiment, the received quotation request includes a search query that specifies search information, such as speaker, subject, keyterms, or the like. The quotation locator 214 determines one or more quotations that match, or approximately match, the search query, and provides (e.g., transmits, sends, displays) the determined one or more quotations to the user 202. In some embodiments, the search query uses relationship searching, such as that described in U.S. Pat. No. 7,526,425, to identify matching quotations beyond that provided by simple keyword matching or regular expression techniques. Additional techniques for quotation location are discussed with reference to
The other content recommender 215 provides other types of content recommendations, such as for articles, entities, product information, advertisements, and the like. For example, in one embodiment, the other content recommender 215 is or includes an article recommender that determines articles that are related to collections of entities specified by the user 202. In other embodiments, the other content recommender 215 is or includes an entity recommender that determines entities that are related to collections of entities specified by the user 202. One such example entity recommender for use with collections of entities is described in detail in U.S. Patent Application No. 61/309,318, filed Mar. 1, 2010, and entitled “CONTENT RECOMMENDATION BASED ON COLLECTIONS OF ENTITIES,” which is incorporated herein by reference in its entirety.
The described techniques herein are not limited to the specific architecture shown in
In addition, although the described techniques for providing quotations are illustrated primarily with respect to text documents, other forms of content items are contemplated. For example, other embodiments may utilize at least some of the described techniques to provide quotations extracted from other forms of content, including video, audio, and the like. Also, text documents include document types beyond documents represented in a text format (e.g., ASCII or UNICODE documents). In particular, text documents include any documents that have any textual content, independent of format, such as PDF documents, Microsoft Office documents, and the like.
The various search techniques discussed above can be combined in various ways in other embodiments. For example, searches can be made for quotes by entities having a specified facet and that include one or more specified keyterms (e.g., quotes by politicians and containing the keyterms “global warming”). Or searches can be made for quotes made by specified entities about entities having a specified facet and including one or more specified keyterms (e.g., quotes by comedians about politicians and containing the keyterms “global warming”). A search language syntax provided and implemented by a specific example embodiment is described with reference to the section entitled “Quotation Recommendation Details in an Example Embodiment,” below.
Although the user interface techniques of
Note that one or more general purpose or special purpose computing systems/devices may be used to implement the content recommendation system 510. In addition, the computing system 500 may comprise one or more distinct computing systems/devices and may span distributed locations. Furthermore, each block shown may represent one or more such blocks as appropriate to a specific embodiment or may be combined with other blocks. Also, the content recommendation system 510 may be implemented in software, hardware, firmware, or in some combination to achieve the capabilities described herein.
In the embodiment shown, computing system 500 comprises a computer memory (“memory”) 501, a display 502, one or more Central Processing Units (“CPU”) 504, Input/Output devices 504 (e.g., keyboard, mouse, CRT or LCD display, and the like), other computer-readable media 505, and network connections 506. The content recommendation system 510 is shown residing in memory 501. In other embodiments, some portion of the contents, some or all of the components of the content recommendation system 510 may be stored on and/or transmitted over the other computer-readable media 505. The components of the content recommendation system 510 preferably execute on one or more CPUs 503 and extract and provide quotations, as described herein. Other code or programs 530 (e.g., an administrative interface, a Web server, and the like) and potentially other data repositories, such as data repository 520, also reside in the memory 501, and preferably execute on one or more CPUs 503. Of note, one or more of the components in
In a typical embodiment, the content recommendation system 510 includes a content ingester 511, an entity and relationship identifier 512, a quotation extractor 513, a quotation locator 514, a user interface manager 515, a quotation provider application program interface (“API”) 516, and a data store 517. The content ingester 511, entity and relationship identifier 512, user interface manager 515, and recommender API 516 are drawn in dashed lines to emphasize that in other embodiments, functions performed by one or more of these components may be performed externally to the content recommendation system 510. In other embodiments, the content recommendation system 510 includes other content recommendation modules that are configured to provide other types of content, such as article or entity recommendations based on user searches, user preferences, entity collections, and the like.
The content ingester 511 performs functions such as those described with reference to the content ingester 211 of
The entity and relationship identifier 512 performs functions such as those described with reference to the entity and relationship identifier 212 of
The UI manager 515 provides a view and a controller that facilitate user interaction with the content recommendation system 510 and its various components. For example, the UI manager 515 may provide interactive access to the content recommendation system 510, such that users can obtain quotations, generate quotations widgets, and the like. In some embodiments, access to the functionality of the UI manager 515 may be provided via a Web server, possibly executing as one of the other programs 530. In such embodiments, a user operating a Web browser executing on one of the client devices 560 can interact with the content recommendation system 510 via the UI manager 515.
The quotation extractor 513 performs functions such as those described with reference to the quotation extractor 213 of
The quotation locator 514 performs functions such as those described with reference to the quotation locator 214 of
In one embodiment, the quotation locator 514 operates synchronously in an on-demand manner, in that it performs its functions in response to a received request, such as in response to a user interface event processed by the UI manager 515. In another embodiment, the quotation locator 514 operates asynchronously, in that it automatically determines quotations for one or more queries. For example, the quotation locator 514 may automatically execute from time to time (e.g., once per hour, once per day) in order to generate bulk quotation information for commonly requested (or recently used) queries. The quotation locator 514 may execute upon the occurrence of other types of conditions, such as when new quotations are extracted and/or stored, and the like.
The API 516 provides programmatic access to one or more functions of the content recommendation system 510. For example, the API 516 may provide a programmatic interface to one or more functions of the content recommendation system 510 that may be invoked by one of the other programs 530 or some other module. In this manner, the API 516 facilitates the development of third-party software, such as user interfaces, plug-ins, widgets, news feeds, adapters (e.g., for integrating functions of the content recommendation system 510 into Web applications), and the like. In addition, the API 516 may be in at least some embodiments invoked or otherwise accessed via remote entities, such as one of the third-party applications 565, to access various functions of the content recommendation system 510. For example, a third-party application may request quotations from the content recommendation system 510 via the API 516. The API 516 may also be configured to provide quotations widgets (e.g., code modules) that can be integrated into third-party applications and that are configured to interact with the content recommendation system 510 to make at least some of the described functionality available within the context of other applications. The section entitled “Example Quotation Recommendation API,” below, describes an example API provided by one specific embodiment of an example CRS.
The data store 517 is used by the other modules of the content recommendation system 510 to store and/or communicate information. As discussed above, components 511-516 use the data store 517 to record various types of information, including content, information about stored content including entities and relationships, information about quotations, user information, and the like. Although the components 511-516 are described as communicating primarily through the data store 517, other communication mechanisms are contemplated, including message passing, function calls, pipes, sockets, shared memory, and the like.
The content recommendation system 510 interacts via the network 550 with content sources 555, third-party applications 565, and client computing devices 560. The network 550 may be any combination of media (e.g., twisted pair, coaxial, fiber optic, radio frequency), hardware (e.g., routers, switches, repeaters, transceivers), and protocols (e.g., TCP/IP, UDP, Ethernet, Wi-Fi, WiMAX) that facilitate communication between remotely situated humans and/or devices. The client computing devices 560 include desktop computing systems, notebook computers, mobile phones, smart phones, personal digital assistants, tablet computers, and the like.
Other or additional functions and/or data are contemplated. For example, in some embodiments, the content recommendation system 510 includes additional content recommendation components that are specialized to other types of content, such as for video, quotations, images, audio, advertisements, product information, and the like.
In an example embodiment, components/modules of the content recommendation system 510 are implemented using standard programming techniques. For example, the content recommendation system 510 may be implemented as a “native” executable running on the CPU 503, along with one or more static or dynamic libraries. In other embodiments, the content recommendation system 510 may be implemented as instructions processed by a virtual machine that executes as one of the other programs 530. In general, a range of programming languages known in the art may be employed for implementing such example embodiments, including representative implementations of various programming language paradigms, including but not limited to, object-oriented (e.g., Java, C++, C#, Visual Basic.NET, Smalltalk, and the like), functional (e.g., ML, Lisp, Scheme, and the like), procedural (e.g., C, Pascal, Ada, Modula, and the like), scripting (e.g., Perl, Ruby, Python, JavaScript, VBScript, and the like), and declarative (e.g., SQL, Prolog, and the like).
The embodiments described above may also use either well-known or proprietary synchronous or asynchronous client-server computing techniques. Also, the various components may be implemented using more monolithic programming techniques, for example, as an executable running on a single CPU computer system, or alternatively decomposed using a variety of structuring techniques known in the art, including but not limited to, multiprogramming, multithreading, client-server, or peer-to-peer, running on one or more computer systems each having one or more CPUs. Some embodiments may execute concurrently and asynchronously, and communicate using message passing techniques. Equivalent synchronous embodiments are also supported. Also, other functions could be implemented and/or performed by each component/module, and in different orders, and by different components/modules, yet still achieve the described functions.
In addition, programming interfaces to the data stored as part of the content recommendation system 510, such as in the data store 517, can be available by standard mechanisms such as through C, C++, C#, and Java APIs; libraries for accessing files, databases, or other data repositories; through scripting languages such as XML; or through Web servers, FTP servers, or other types of servers providing access to stored data. The data store 517 may be implemented as one or more database systems, file systems, or any other technique for storing such information, or any combination of the above, including implementations using distributed computing techniques.
Different configurations and locations of programs and data are contemplated for use with techniques of described herein. A variety of distributed computing techniques are appropriate for implementing the components of the illustrated embodiments in a distributed manner including but not limited to TCP/IP sockets, RPC, RMI, HTTP, Web Services (XML-RPC, JAX-RPC, SOAP, and the like). Other variations are possible. Also, other functionality could be provided by each component/module, or existing functionality could be distributed amongst the components/modules in different ways, yet still achieve the functions described herein.
Furthermore, in some embodiments, some or all of the components of the content recommendation system 510 may be implemented or provided in other manners, such as at least partially in firmware and/or hardware, including, but not limited to one or more application-specific integrated circuits (“ASICs”), standard integrated circuits, controllers executing appropriate instructions, and including microcontrollers and/or embedded controllers, field-programmable gate arrays (“FPGAs”), complex programmable logic devices (“CPLDs”), and the like. Some or all of the system components and/or data structures may also be stored as contents (e.g., as executable or other machine-readable software instructions or structured data) on a computer-readable medium (e.g., as a hard disk; a memory; a computer network or cellular wireless network or other data transmission medium; or a portable media article to be read by an appropriate drive or via an appropriate connection, such as a DVD or flash memory device) so as to enable or configure the computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide the contents to perform at least some of the described techniques. Some or all of the system components and/or data structures may be stored as non-transitory content on one or more tangible computer-readable mediums. Some or all of the system components and data structures may also be stored as data signals (e.g., by being encoded as part of a carrier wave or included as part of an analog or digital propagated signal) on a variety of computer-readable transmission mediums, which are then transmitted, including across wireless-based and wired/cable-based mediums, and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). Such computer program products may also take other forms in other embodiments. Accordingly, embodiments of this disclosure may be practiced with other computer system configurations.
The illustrated process begins at block 602, where it receives an indication of a document. Receiving an indication of a document may include receiving a document reference (e.g., a file name, a uniform resource identifier, a database identifier). The received document reference may identify an original source document, such as may be found on a remote Web server or other document source, such as one of the content sources 255 described with reference to
At block 604, the process identifies entities appearing within the text document, such as using an NLP-based recognition and disambiguation process, for use in attributing any quotations found within the text document. Identifying entities may include performing linguistic and/or semantic analysis of the document. Linguistic and semantic analysis may include such operations as sentence/paragraph detection, parts-of-speech and/or grammatical role tagging, phrase or clause detection, and the like. Identifying entities may also include linking references to the same entity across the document, including by resolving pronoun co-references, aliases and abbreviations, and definite-noun anaphora, and the like. Identifying entities may include other operations, such as entity disambiguation, facet assignment, and the like.
At block 606, the process determines whether there are quotations within the text document. One way the CRS detects quotations is by determining whether one of a predetermined verb likely to indicate a quotation is present within the text and whether quotation marks are present in the text. The presence of both will indicate a more likely chance that a quotation is present within the document. Thus, determining quotations includes detecting quotations by extracting one or more sentences of a potential quotation based on detected quotation verbs (e.g., say, comment, suggest) and/or quotation punctuation (e.g., double or single quotation marks).
At block 608, the process attributes the quotation (e.g., by determining the speaker entity of the quotation) using, for example, the identified entities in block 604 and determines additional quotation information, for example determining the quotation verb; determining the quotation text; and determining other entities, facets, and/or keywords referenced in the quotation.
At block 610, the process stores (or otherwise indexes) the determined quotation information. Storing the determined quotation information may include, for each quotation detected at block 606, indexing a subject-action-object triple, where the subject field identifies the speaker entity of the quotation, the action field identifies the quotation verb, and the object field stores the quotation text. Other information may be indexed, including various modifiers (e.g., subject or verb modifiers), other entity information (e.g., entities and/or facets referenced by the quotation), time and date information, source information, and the like.
After block 610, the process returns.
Some embodiments perform one or more operations/aspects in addition to, or instead of, the ones described with reference to the process of
The illustrated process begins at block 702, where it receives a quotation search request. The quotation search request may be received from various sources, such as from an interactive user interface being operated by a user and/or a quotations widget or other code module configured to automatically request quotations.
Typically, the quotation search request specifies one or more features that are to be present in any quotation that matches the request. For example, the request may specify one or more quotation speakers (e.g., by indicating an entity or a facet), one or more quotation subjects (e.g., by indicating entities, facets, and/or keywords), and the like. In some embodiments, the specified features can be combined or modified, such as via Boolean operators (e.g., AND, OR, NOT).
At block 704, the process locates one or more quotations that match the received search request. Locating matching quotations may include searching an index or other representation of quotation information, to determine one or more quotations that match or approximately match one or more features of the quotation search request. In one embodiment, locating matching quotations includes searching the quotations store 217d based upon NLP-based search techniques as described with reference to
At block 706, the process provides indications of the located quotations. Providing indications of the located quotations may include presenting, or transmitting for presentation, various information about the located quotations, such as quotation text, quotation attribution (e.g., speaker), context (e.g., text surrounding the quotation), and the like. Providing quotations may also include ranking or ordering the located quotations based on one or more factors, such as publication date, source credibility, and number of duplicate quotations.
Some embodiments perform one or more operations/aspects in addition to, or instead of, the ones described with reference to the process of
In the following, additional example techniques for extracting and providing quotations are discussed.
Quotation Recommendation Details in an Example Embodiment
The following describes an approach to quotation extraction and search used by one example embodiment.
Overview
In this example embodiment, the quotation extraction and search subsystem of the content recommendation system (“CRS”) has three components:
1. Quotation extraction and attribution from text documents
2. Indexing of the extracted quotations and attributions in an efficient inverted index
3. Search of the indexed quotations
On any entity profile page, if the entity is a person, the CRS presents retrieved quotes about the person, followed by quotes made by the person. For example, on the profile page of President Obama:
http:www.evri.com/person/barack-obama-0x16f69/quotes
Quotes about Barack Obama
Quotes by Barack Obama
If the entity is not a person, the CRS presents quotes about the entity, e.g.,
The CRS also surfaces quotes about any keywords or phrases, e.g., http://www.evri.com/find/quotes?query=global+warming returns quotes about phrase “global warming”
Evri Public APIs:
In addition, the CRS exposes quotation search capability via a set of public APIs (see http://www.evri.com/developer/rest#API-GetQuotations).
Quotes Public API Examples:
The “facets” of an entity are typically discoverable from an ontology or taxonomy of entities. A list of example facets appears in the section entitled “Example Facets.” Fewer or more facets can be made available, and the set of facets used by the system are generally configurable. The facets may be organized into a hierarchical taxonomy. Therefore, a facet can be represented as a taxonomic path, e.g., [Person/Sports/Athlete/Football_Player]. During search time, the CRS has the flexibility to support query on “Football_Player”, or any of its parent nodes in the taxonomic path, e.g., “Athlete” or “Person”.
An example API is discussed in more detail below in the section entitled “Example Quotation Recommendation API.”
Quotation Extraction and Attribution
I. Linguistic and semantic analysis: Given a text document, the example CRS applies deep linguistic analysis that includes at least some of the following steps:
II. Quotation Verb Detection:
III. Attribution and Collapsing: Collapse each detected quotation into a triple of (speaker, verb, quote)
Each extracted quotation and attribution is stored as a triple in an inverted index structure of subject-action-object triples. The quotation triples are distinguished from regular triples by a flag is Quotation. During search, only quotation triples will be retrieved when the is Quotation flag is set in the query.
For entities identified both within and outside of the quote, the CRS indexes not only the entity names, but also their unique identifiers and assigned categories (e.g., types and facets). Therefore, during search, the CRS supports search for quotes by or about entities by their names, as well as by their IDs or categories (e.g., find quotes made by any college football coach, or find quotes about any hybrid cars).
The subject-modifier field would support search for quotes made by speakers of certain properties, e.g., “Did anyone from Microsoft say anything about iPhone?”
Similarly, the action-modifier field supports searching for quotes within a particular context, e.g., “What did Obama say about global warming during his trip to China?”
Text snippet: Cleveland Cavaliers star LeBron James stuck his nose in the situation, admitting he is counseling Pryor on the pitfalls of being in the spotlight at a young age.
“I'm trying to be that guy who can really help him get through a lot of situations which he's never seen before but now he's seeing and understanding,” James said.
In this example, the CRS are able to link “James” as the last name of LeBron James. Through coreference resolution, the CRS resolves the pronouns “he” and “him” to Terrelle Pryor, a football quarterback of the Ohio State University football team. Furthermore, the CRS tags Pryor with facet “football_player”. Therefore, when user queries about any comments made by LeBron James on any football players, this quote would be returned as one of the results.
Nash said, “I would love to meet him, obviously, and to play hoops with the President would be kind of fun.”
This quote is from Steve Nash of the Phoenix Suns NBA basketball team, about President Obama. Through coreference resolution, the CRS recognizes “him” and “the president” refer to President Obama. The CRS assigns the facet “basketball player” to Steve Nash. Therefore, when user queries about any comments made by any basketball players (or any sports athletes) regarding President Obama, this quote would be returned as one of the results.
“They might think they've got a pretty good jump shot or a pretty good flow, but our kids can't all aspire to be LeBron or Lil Wayne,” Obama said.
The CRS recognizes LeBron as LeBron James, the NBA basketball player, and Lil Wayne as a musician. The pronoun “they” is linked to “children” in the previous sentence. When the query is for Obama's quotes regarding basketball player or musician, this quote would be returned.
Search
Query:
The CRS supports querying of quotations in many different ways, in the form of a template: What did <speaker> say about <subject>?
The parameter <speaker> can be specified as:
Furthermore, the speaker field can be constrained by some modifiers, e.g.,
The <subject> can be specified as:
The CRS also supports boolean combinations (e.g., AND, OR) of the above. For example:
The query result returned is a list of quotations, that contain the following:
Sometimes, the quote is very long such that the CRS needs to extract a snippet of a specified length that best matches the query request. During processing and indexing, the CRS has identified the entities in each sentence, as well as their positions within the sentence. Given a query request on a particular subject (specified as entity or keyword), the CRS determines the snippet that has most occurrences of the subject entity/keyword.
Result Aggregation and Ranking:
Sometimes, what was said by a speaker could be quoted in different documents. When retrieving quotes, the CRS applies an aggregation process to detect duplicate quotes by computing the similarity between each pair of quotes.
The quotes are then ranked by a combination of the following factors:
Users can choose to sort the results by their default rank or purely by date.
Example Quotation Recommendation API
A. Getting Quotations
Returns quotations made about a topic in the Evri corpus of news, blog and other web content. In addition, quotations made by a specific person may be returned.
The API can be invoked with a request of the following form:
quotations/[about]?[inputParameters]&speaker=SPEAKER&[inputParameters]
where SPEAKER is the URI, or href, of a person and applicable inputParameters include: facet, entityURI, includeDomains, excludeDomains, includeDates, includeMatchedLocations, and callback.
Quotations by a person about anything:
http://api.evri.com/v1/quotations?speaker=/person/barack-obama-0x16f69&appld=evri.com-restdoc
Quotations by anyone about a specific entity:
http://api.evri.com/v1/quotations/about?entityURI=/location/united-states-0x2ae4b&appld=evri.com-restdoc
Quotations by anyone about a facet:
http://api.evri.com/v1/quotations/about?facet=politician&appld=evri.com-restdoc
Quotations by a person about an entity:
http://api.evri.com/v1/quotations/about?entityURI=/person/george-w.-bush-0x1beeb&speaker=/person/barack-obama-0x16f69&appld=evri.com-restdoc
Quotations by a person about any entity of a facet:
http://api.evri.com/v1/quotations/about?facet=politician&speaker=/person/barack-obama-0x16f69&appld=evri.com-restdoc
B. Input Parameters
The following parameters affect output results. See the usage section for each resource to assess applicability.
Example Entity Types
The following Table defines several example entity types in an example embodiment. Other embodiments may incorporate different types.
Example Facets
The following Table defines several example facets in an example embodiment. Other embodiments may incorporate different facets.
All of the above U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification and/or listed in the Application Data Sheet, including but not limited to U.S. Provisional Patent Application No. 61/319,029, entitled “NLP-BASED SYSTEMS AND METHODS FOR PROVIDING QUOTATIONS,” filed Mar. 30, 2010, is incorporated herein by reference, in its entirety.
From the foregoing it will be appreciated that, although specific embodiments have been described herein for purposes of illustration, various modifications may be made without deviating from the spirit and scope of this disclosure. For example, the methods, techniques, and systems for content recommendation are applicable to other architectures. For example, instead of utilizing a Vector Space Model of document indexing, systems that are programmed to perform natural language processing (e.g., parts of speech tagging) can be employed. Also, the methods, techniques, and systems discussed herein are applicable to differing query languages, protocols, communication media (optical, wireless, cable, etc.) and devices (such as wireless handsets, mobile communications devices, electronic organizers, personal digital assistants, portable email machines, game machines, pagers, navigation devices such as GPS receivers, etc.).
Number | Name | Date | Kind |
---|---|---|---|
4839853 | Deerwester et al. | Jun 1989 | A |
5301109 | Landauer et al. | Apr 1994 | A |
5317507 | Gallant | May 1994 | A |
5325298 | Gallant | Jun 1994 | A |
5331556 | Black, Jr. et al. | Jul 1994 | A |
5377103 | Lamberti et al. | Dec 1994 | A |
5619709 | Caid et al. | Apr 1997 | A |
5634051 | Thomson | May 1997 | A |
5778362 | Deerwester | Jul 1998 | A |
5794050 | Dahlgren et al. | Aug 1998 | A |
5794178 | Caid et al. | Aug 1998 | A |
5799268 | Boguraev | Aug 1998 | A |
5848417 | Shoji et al. | Dec 1998 | A |
5857179 | Vaithyanathan et al. | Jan 1999 | A |
5884302 | Ho | Mar 1999 | A |
5933822 | Braden-Harder et al. | Aug 1999 | A |
5950189 | Cohen et al. | Sep 1999 | A |
5982370 | Kamper | Nov 1999 | A |
6006221 | Liddy et al. | Dec 1999 | A |
6006225 | Bowman et al. | Dec 1999 | A |
6026388 | Liddy et al. | Feb 2000 | A |
6061675 | Wical | May 2000 | A |
6064951 | Park et al. | May 2000 | A |
6122647 | Horowitz et al. | Sep 2000 | A |
6185550 | Snow et al. | Feb 2001 | B1 |
6192360 | Dumais et al. | Feb 2001 | B1 |
6202064 | Julliard | Mar 2001 | B1 |
6219664 | Watanabe | Apr 2001 | B1 |
6246977 | Messerly et al. | Jun 2001 | B1 |
6311152 | Bai et al. | Oct 2001 | B1 |
6363373 | Steinkraus | Mar 2002 | B1 |
6405190 | Conklin | Jun 2002 | B1 |
6411962 | Kupiec | Jun 2002 | B1 |
6460029 | Fries et al. | Oct 2002 | B1 |
6510406 | Marchisio | Jan 2003 | B1 |
6571236 | Ruppelt | May 2003 | B1 |
6584464 | Warthen | Jun 2003 | B1 |
6601026 | Appelt et al. | Jul 2003 | B2 |
6728707 | Wakefield et al. | Apr 2004 | B1 |
6732097 | Wakefield et al. | May 2004 | B1 |
6732098 | Wakefield et al. | May 2004 | B1 |
6738765 | Wakefield et al. | May 2004 | B1 |
6741988 | Wakefield et al. | May 2004 | B1 |
6745161 | Arnold et al. | Jun 2004 | B1 |
6757646 | Marchisio | Jun 2004 | B2 |
6859800 | Roche et al. | Feb 2005 | B1 |
6862710 | Marchisio | Mar 2005 | B1 |
6910003 | Arnold et al. | Jun 2005 | B1 |
6996575 | Cox et al. | Feb 2006 | B2 |
7051017 | Marchisio | May 2006 | B2 |
7054854 | Hattori et al. | May 2006 | B1 |
7171349 | Wakefield et al. | Jan 2007 | B1 |
7283951 | Marchisio et al. | Oct 2007 | B2 |
7398201 | Marchisio et al. | Jul 2008 | B2 |
7403938 | Harrison et al. | Jul 2008 | B2 |
7451135 | Goldman et al. | Nov 2008 | B2 |
7526425 | Marchisio et al. | Apr 2009 | B2 |
7672833 | Blume et al. | Mar 2010 | B2 |
7788084 | Brun et al. | Aug 2010 | B2 |
8122016 | Lamba et al. | Feb 2012 | B1 |
8122026 | Laroco, Jr. et al. | Feb 2012 | B1 |
8132103 | Chowdhury et al. | Mar 2012 | B1 |
20020007267 | Batchilo et al. | Jan 2002 | A1 |
20020010574 | Tsourikov et al. | Jan 2002 | A1 |
20020059161 | Li | May 2002 | A1 |
20020078041 | Wu | Jun 2002 | A1 |
20020091671 | Prokoph | Jul 2002 | A1 |
20020103789 | Turnbull et al. | Aug 2002 | A1 |
20020120651 | Pustejovsky et al. | Aug 2002 | A1 |
20020156763 | Marchisio | Oct 2002 | A1 |
20030004716 | Haigh et al. | Jan 2003 | A1 |
20030101182 | Govrin et al. | May 2003 | A1 |
20030115065 | Kakivaya et al. | Jun 2003 | A1 |
20030115191 | Copperman et al. | Jun 2003 | A1 |
20030233224 | Marchisio et al. | Dec 2003 | A1 |
20040010508 | Fest et al. | Jan 2004 | A1 |
20040044669 | Brown et al. | Mar 2004 | A1 |
20040064447 | Simske et al. | Apr 2004 | A1 |
20040103090 | Dogl et al. | May 2004 | A1 |
20040125877 | Chang et al. | Jul 2004 | A1 |
20040167870 | Wakefield et al. | Aug 2004 | A1 |
20040167883 | Wakefield et al. | Aug 2004 | A1 |
20040167884 | Wakefield et al. | Aug 2004 | A1 |
20040167885 | Wakefield et al. | Aug 2004 | A1 |
20040167886 | Wakefield et al. | Aug 2004 | A1 |
20040167887 | Wakefield et al. | Aug 2004 | A1 |
20040167907 | Wakefield et al. | Aug 2004 | A1 |
20040167908 | Wakefield et al. | Aug 2004 | A1 |
20040167909 | Wakefield et al. | Aug 2004 | A1 |
20040167910 | Wakefield et al. | Aug 2004 | A1 |
20040167911 | Wakefield et al. | Aug 2004 | A1 |
20040221235 | Marchisio et al. | Nov 2004 | A1 |
20040243388 | Corman et al. | Dec 2004 | A1 |
20050027704 | Hammond et al. | Feb 2005 | A1 |
20050108001 | Aarskog | May 2005 | A1 |
20050108262 | Fawcett et al. | May 2005 | A1 |
20050138018 | Sakai et al. | Jun 2005 | A1 |
20050144064 | Calabria et al. | Jun 2005 | A1 |
20050149494 | Lindh et al. | Jul 2005 | A1 |
20050177805 | Lynch et al. | Aug 2005 | A1 |
20050197828 | McConnell et al. | Sep 2005 | A1 |
20050210000 | Michard | Sep 2005 | A1 |
20050216443 | Morton et al. | Sep 2005 | A1 |
20050267871 | Marchisio et al. | Dec 2005 | A1 |
20060149734 | Egnor et al. | Jul 2006 | A1 |
20060279799 | Goldman | Dec 2006 | A1 |
20070067285 | Blume et al. | Mar 2007 | A1 |
20070143300 | Gulli et al. | Jun 2007 | A1 |
20070156669 | Marchisio et al. | Jul 2007 | A1 |
20070209013 | Ramsey et al. | Sep 2007 | A1 |
20080005651 | Grefenstette et al. | Jan 2008 | A1 |
20080010270 | Gross | Jan 2008 | A1 |
20080059456 | Chowdhury et al. | Mar 2008 | A1 |
20080082578 | Hogue et al. | Apr 2008 | A1 |
20080097975 | Guay et al. | Apr 2008 | A1 |
20080097985 | Olstad et al. | Apr 2008 | A1 |
20080235203 | Case et al. | Sep 2008 | A1 |
20080288456 | Omoigui | Nov 2008 | A1 |
20080303689 | Iverson | Dec 2008 | A1 |
20090076886 | Dulitz et al. | Mar 2009 | A1 |
20090144609 | Liang et al. | Jun 2009 | A1 |
20090228439 | Manolescu et al. | Sep 2009 | A1 |
20100010994 | Wittig et al. | Jan 2010 | A1 |
20100046842 | Conwell | Feb 2010 | A1 |
20100048242 | Rhoads et al. | Feb 2010 | A1 |
20100250497 | Redlich et al. | Sep 2010 | A1 |
Number | Date | Country |
---|---|---|
0 280 866 | Sep 1988 | EP |
0 597 630 | Jul 2002 | EP |
WO 0014651 | Mar 2000 | WO |
WO 0057302 | Sep 2000 | WO |
WO 0122280 | Mar 2001 | WO |
WO 0180177 | Oct 2001 | WO |
WO 0227536 | Apr 2002 | WO |
WO 0233583 | Apr 2002 | WO |
WO 03017143 | Feb 2003 | WO |
WO 2004053645 | Jun 2004 | WO |
WO 2004114163 | Dec 2004 | WO |
WO 2006068872 | Jun 2006 | WO |
Entry |
---|
Abraham, “FoXQ—Xquery by Forms,” Human Centric Computing Languages and Environments, Proceedings 2003 IEEE Symposium, Oct. 28-31, 2003, Piscataway, New Jersey, pp. 289-290. |
Cass, “A Fountain of Knowledge,” IEEE Spectrum Online, URL: http://www.spectrum.iee.org/WEBONLY/publicfeature/jan04/0104comp1.html, download date Feb. 4, 2004, 8 pages. |
Feldman et al., “Text Mining at the Term Level,” Proc. of the 2nd European Symposium on Principles of Data Mining and Knowledge Discover, Nantes, France, 1998. |
Ilyas et al., “A Conceptual Architecture for Semantic Search Engine,” IEEE, INMIC, 2004, pp. 605-610. |
Jayapandian et al., “Automating the Design and Construction of Query Forms,” Data Engineering, Proceedings of the 22nd International Conference IEEE, Atlanta, Georgia, Apr. 3, 2006, pp. 125-127. |
Kaiser, “Ginseng—A Natural Language User Interface for Semantic Web Search,” University of Zurich, Sep. 16, 2004, URL=http://www.ifi.unizh.ch/archive/mastertheses/DA—Arbeiten—2004/Kaiser—Christian.pdf, pp. 1-84. |
Liang et al., “Extracting Statistical Data Frames from Text,” SIGKDD Explorations, Jun. 2005, vol. 7, No. 1, pp. 67-75. |
Littman et al., “Automatic Cross-Language Information Retrieval using Latent Semantic Indexing,” In Grefenstette, G., editor, Cross Language Information Retrieval. Kluwer, 1998. |
Nagao et al., “Semantic Annotation and Transcoding: Making Web Content More Accessible,” IEEE Multimedia, IEEE Computer Society, US. 8(2):69-81, Apr. 2001. |
Nguyen et al., “Accessing Relational Databases from the World Wide Web,” SIGMOD Record, ACM USA, Jun. 1996, vol. 25, No. 2, pp. 529-540. |
Pohlmann et al., “The Effect of Syntactic Phrase Indexing on Retrieval Performance for Dutch Texts,” Proceedings of RIAO, pp. 176-187, Jun. 1997. |
Rasmussen, “WDB-A Web Interface to Sybase,” Astronomical Society of the Pacific Conference Series, Astron. Soc. Pacific USA, 1995, vol. 77, pp. 72-75. |
Sneiders, “Automated Question Answering Using Question Templates That Cover the Conceptual Model of the Database,” Natural Language Processing and Information Systems, 6th International Conference on Applications of Natural Language to Information Systems, Revised Papers (Lecture Notes in Computer Science vol. 2553), Springer-Verlag, Berlin, Germany, 2002, vol. 2553, pp. 235-239. |
Ruiz-Casado et al., “From Wikipedia to Semantic Relationships: A Semi-Automated Annotation Approach”, 2006, pp. 1-14. |
Florian et al., “Named Entity Recognition through Classifier Combination”, 2003, IBM T.J. Watson Research Center, pp. 168-171. |
Dekai Wu, A Stacked, Voted, Stacked Model for Named Entity Recognition, 2003, pp. 1-4. |
Google, “How to Interpret Your Search Results”, http://web.archive.org/web/20011116075703/http://www.google.com/intl/en/help/interpret.html, Mar. 27, 2001; 6 pages. |
Razvan Bunescu et al., “Using Encyclopedic Knowledge for Named Entity Disambiguation”, 2006, Google, pp. 9-16. |
Silviu Cucerzan, “Large-Scale Named Entity Disambiguation Based on Wikipedia Data”, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 708-716, Prague, Jun. 2007. |
Razvan Constantin Bunescu, “Learning for Information Extraction: From Named Entity Recognition and Disambiguation to Relation Extraction” The Dissertation Committee for Aug. 2007, The University of Texas at Austin, pp. 1-150. |
Joseph Hassell et al., “Ontology-Driven Automatic Entity Disambiguation in Unstructured Text” Large Scale Distributed Information Systems (LSDIS) Lab Computer Science Department, University of Georgia, Athens, GA 30602-7404, ISWC 2006, LNCS 4273, pp. 44-57. |
Levon Lloyd et al. “Disambiguation of References to Individuals” IBM Research Report, Oct. 28, 2005, pp. 1-9. |
Number | Date | Country | |
---|---|---|---|
20110246181 A1 | Oct 2011 | US |
Number | Date | Country | |
---|---|---|---|
61319029 | Mar 2010 | US |