This application relates to information services, such as information services for facts extracted from content meaning across differing sources on a wide area network. Content meaning can be derived through linguistic analysis, metadata, or other approaches.
Many approaches for extracting and using information from large networking environments, such as the Internet, have been proposed and implemented. Search engines and manually generated indexes are among the most common tools used for this purpose today, but there are literally hundreds of other specialized and/or complex data mining techniques that have been developed. And a large amount of effort is constantly being expended to improve and reengineer existing approaches as well as to develop new ones.
In one general aspect, the invention features a wide area network fact information service method that includes storing a plurality of canonical fact entries storing one or more fact descriptor entries for each of the canonical fact entries, and ranking the canonical fact entries relative to each other based on the descriptor entries.
In preferred embodiments, the step of ranking can be a continuous process that recalculates the rank of canonical fact entries based on new descriptor entries. The step of ranking cam be an iterative process that recalculates the rank of canonical fact entries based on new descriptor entries. The step of ranking can be a parallel process that recalculates the rank of canonical fact entries based on new descriptor entries. The ranking can be based on a predetermined ranking for entities associated with the canonical fact entries. The entity ranking can be based on a number of descriptor entries. The entity ranking can be based on a number of descriptor entries within a category. The entity ranking can be based on a sentiment/attitude value. The entity ranking can be based on co-occurrence with other facts and their rankings. The ranking can be based on a ranking of source credibility for data source providers associated with the descriptor entries. A same source provider can have different credibility values for different sources that it provides. A same source provider can have different credibility values for different categories of sources that it provides. The credibility ranking can be based on at least one user interest measure. The ranking can be based on a proximity measure for the descriptor entries. The proximity measure can be a temporal proximity measure. Proximity measures can be combined using a fading factor. The ranking can be based on a plurality of similarity measures that express a similarity between the canonical fact entries and/or fact descriptor entries. The similarity measures can relate to fact type and temporal overlap. The similarity measures can operate according to a hierarchy. The ranking can be based on publication times.
In another general aspect, the invention features a wide area network fact information service system that includes canonical fact entry storage, fact descriptor entry storage for storing one or more fact descriptor entries for each of the canonical fact entries, and a ranker for ranking the canonical fact entries relative to each other based on the descriptor entries.
In a further general aspect, the invention features a wide area network fact information service system that includes means for storing a plurality of canonical fact entries, means for storing one or more fact descriptor entries for each of the canonical fact entries, and means for ranking the canonical fact entries relative to each other based on the descriptor entries.
Systems according to the invention can be beneficial in that they can allow users to approach temporal information about facts in new and powerful ways, enabling them to search, analyze, and trigger external events based on complicated relationships in their past, present, and future temporal characteristics.
Referring to
The system 10 can also include research, monitoring, analysis, and execution machinery 30, which is responsive to the information sources 20. This part of the system can cooperate with a fact data warehouse 50, as well as several external interfaces. A data cache 40 can also be provided to speed up data retrieval in certain circumstances.
The external interfaces include a user interface, which is temporal logic based, for searching historical, present, and future facts 60, and a user interface for defining complex sequences of facts 70. The external interfaces also include a Web services interface, which is temporal logic based, for searching historical, present, and future facts 80, and a Web services-based programming interface for defining complex sequences of facts 90. The system 10 can also generate a “subscribable” fact stream for generated facts in the “real world” (e.g., buying a stock, creating a news story, triggering a supply chain update).
Facts are pieces of information about occurrences that can take place anywhere and can then be described, reported, or otherwise manifested or revealed in some form on a computer network. A sports feed can report facts for a game, for example, such as by updating a score tally. A sports blog can also focus on different facts from the same game and/or can describe the same facts from the same game in different ways.
The facts themselves can also be network-based. In the case of an electronic corporate securities filing, for example, the occurrence on the network of the filing itself can be a fact. And it can also act as a source of descriptive material for facts that it describes, such as a company's product release dates.
The existence of facts and information about them are typically acquired by applying software such as entity and event extractors to text documents/sources. One approach to extraction is to linguistically analyze plain text, such as through the use of services from Reuters, ClearForest, InXight, and/or Attensity. Extraction can also involve simple harvesting where the content already contains meta-data, such as Resource Description Framework (RDF) tags.
If, for example, an article includes the following sentence:
“Fort Orange financial completes $3.3M stock offering.” the system can use linguistic analysis to map the document date to the investment fact. Note that in some circumstances, techniques amounting to less-than-perfect linguistic analysis, such as entity-verb clustering, can be used without excessive loss of performance.
In another example, an article includes the following sentence:
“Look for a barrage of shareholder lawsuits against Yahoo next week”
In this case, the system can map the lawsuit fact to a “next week” timepoint (a scheduled future fact).
Future facts can be scheduled facts, such as the expected Yahoo lawsuits or events extracted from an Internet calendar. They can also be predicted based on a variety of prediction methods. These can range from complex statistical forecasting methods to simple inferences, such as where a company's next annual meeting is predicted to be on the same day as all of its past annual meetings.
Referring to
Above the fact loading layer 100 is a fact transformation layer 108, which can operate based on linguistics, semantics, and/or mathematics/statistics. Above the fact transformation layer is relations storage 110, a fact data warehouse 112, and fact in-memory segment 114 (cache), and an inverted future (timelines) module 116. At the next level is a fact modeling and computation engine 118, which can work with prediction, correlation, and probabilities. Layered above the fact modeling and computation engine is a temporal-based fact query language 120. A text search/modeling user interface 122, a graphical user interface framework 124, and an application programming interface/software development kit 126 are all layered over the temporal-based fact query language. Domain-specific applications 128 are in turn layered above these modules.
Examples of domain-specific applications can include:
Referring to
The system can present its results to the user in a variety of formats. It can present them in a simple hit list-based result output, similar to that of a traditional search engine, or it can use a temporally oriented format, such as a timeline. It can also use any other suitable user-oriented or machine-oriented format, such as more elaborate graphical user interfaces, RSS feeds, e-mail alerts, XML documents, or proprietary binary formats. Advertising can be associated with results, and this advertising can be targeted based on the specific facts and/or entities involved.
The system can provide a variety of types of services. A fact-based searching system can be provided for use by the general public or a specific segment. Fully customized, minimally filtered, or even raw fact feed subscriptions can also be provided. And more quantitative searching solutions could be provided, as well, such as for financial services applications.
One type of service is a news service. The service receives a user profile, which allows a user to specify interests. Information about facts relevant to these interests can then be provided to the user in a variety of formats, such as feeds, or an electronic newspaper format.
Mapping facts to temporal information in the database allows the system to answer questions that may be difficult to answer with traditional search engines. Here are some examples:
What will the pollen situation be in Boston next week?
Will terminal five be open next month?
What's happening in New York City this week?
When will movie X be released?
When is the next SARS conference?
When is Pfizer issuing debt next?
Where Will George Bush be next week?
Systems according to the invention can also answer more complex questions about the relationship between facts, such as “what happened to similar entities in similar chains of events?”
Referring to
A software development kit 166 allows developers to iterate facts, perform transformations and predictions, and implement user interface elements. The system can also provide a search/query engine 168 as well as user experience templates 170 and rendering 172 to produce different types of interfaces, such as search, timeline, and newspaper interfaces. RSS feeds 174 can also be generated from the database.
The system described above has been implemented in connection with stored special-purpose software program instructions running on a general-purpose computer platform, but it could also be implemented in whole or in part using special-purpose hardware. And while the system can be broken into the series of modules and steps shown in the various figures for illustration purposes, one of ordinary skill in the art would recognize that it is also possible to combine them and/or split them differently to achieve a different breakdown.
Fact ranking is of vital importance to give a good user experience both for monitoring/alert and search applications. A ranking approach based on six concepts is proposed, and several ways of computing the event ranking are suggested.
Referring to
Facts (e.g., events) detected in documents are of course descriptions of “real world” facts; even though they are therefore “fact descriptors”, they are still referred to as “facts”. These event descriptors can be thought of as being related with a corresponding “canonical fact,” as shown in
In the model, a canonical fact has just a fact type and a set of entities, whereas a fact (or fact descriptor) has additional information such as publication date, source, etc.
The entities take different roles, e.g. for an acquisition event the two roles are “Company_Acquirer” and “Company_BeingAcquired” (these are Calais event types). All fact descriptors of the same type and with the same entities (in the corresponding roles) are linked to the canonical fact.
Different existing entity extractors can be used, such as Calais or Basis. These can exhibit some shortcomings, however, in that they can have several entity IDs for what should be the same entity—for example, “Microsoft”/“Microsoft Corp”/“Microsoft Corp.” and “IBM”/“International Business Machines”. Even if and when disambiguation is improved, the system will keep its own identities, to guard from future changes in the entity extractor. In the system's ontology pointers are kept to the different entity extractor identities (hash values). In addition, pointers are kept to other information about the tsup_entity. The tsup entities and the mapping to entities are implemented with two tables: tsup_entity and tsup_extractor_entity_map.
tsup_entity:
tsup_extractor_entity_map:
NOTE: first (master) associated extractor entity_id is used as the tsup_entity id.
The data set is built incrementally as new entities are detected in the output from the entity extractor. Initially, the data set can be populated from the Entity table in the database, and adding to the tsup_extractor_entity_map table using the equivs file (/home/truve/equivs) with the following format (alias; master):
AT&T Inc; AT&T
Alberto-Culver Co; Alberto-Culver
Alitalia SpA; Alitalia
Altria Group Inc; Altria
Amazon.com Inc; Amazon
American International Group Inc; AIG
American International Group; AIG
Arcelor Mittal; Arcelor
ArcelorMittal; Arcelor
BG Group Plc; BG Group
Bank of America Corp; Bank of America
Blackstone Group; Blackstone
Blackstone Group LP; Blackstone
Six concepts are used to perform fact ranking:
Entity Ranking (ER)
Source Credibility (SC)
Initial Event Ranking (IER)
Proximity Measure (PM)
Similarity Measure (SM)
Derived Event Ranking (DER)
For a fact E with source S and included entities e1 . . . en, and with the source credibility function c, and entity ranking function r, the following holds:
IER(E)=f(c(S),g(r(e1) . . . r(en)))
The functions c, f and g all have the value range 0 . . . 1.
There are many ways to choose the functions f and g; one choice is to multiply the source credibility with the aggregated entity rankings, i.e. f(x,y)=x*y. The entity weight aggregation function g can be chosen e.g. as the maximum ranking of any included entity, or the mean ranking. Other aggregation functions are of course also possible.
Derived Event Ranking
The DER is calculated through an iterative function, with initial values being given by:
DER0(e)=IER(e)
The simplest DER function used, DERA, just “boosts” events if they are similar to other events with higher ranking (or, more precisely, higher ranking times similarity): DERA(n+1)(e)=FOR ALL EVENTS i=1 . . . max: SUM(DERAn(ei)*SM(e,ei))/m WHERE m=number of events (1 . . . max) for which SM(e,ei)>0 and DERAn(e)<DERAn(ei)*SM(e,ei)
Assuming there are two documents, document 1 with fact A and B and document 2 with facts C and D. The facts have the following IER:
IER(A)=4.0
IER(B)=2.0
IER(C)=4.0
IER(D)=1.0
Furthermore, assume that there is one SM>0 (apart from entities being fully similar to themselves!)
SM(B,C)=0.8
This is a high degree of similarity—for example the same event type and same entities but different times. This example is shown graphically in
The DERA function converges after 29 iterations with the following values:
DERA29(A)=4.0
DERA29(B)=3.2
DERA29(C)=4.0
DERA29(D)=1.0
So, event B has been given a higher ranking than its IER, and all other rankings remain unchanged.
Here is a slightly more complicated example, shown graphically in
IER(A)=4.0
IER(B)=2.0
IER(C)=4.0
IER(D)=10.0
SM(B,C)=0.8
SM(C,D)=0.5
In this case, the iterative method converges after 34 iterations with the following result:
DERA34(A)=4.0
DERA34(B)=4.0
DERA34(C)=5.0
DERA34(D)=10.0
So, both B and C have now been boosted, although C has not been boosted so much since the similarity to the highly ranked D is not so big.
Proof Sketch that DERA Always Converges
The DERA iterative method always converges for the following reasons:
An alternative method is to use both similarity and proximity to compute the derived ranking.
Counting Canonical Facts—The DERC Method
The DERA method uses the fact similarity measure SM to allow similar facts to influence each other's ranking. Taking this one step further, a system can start counting the number of occurrences of each “canonical fact”. As an example, there will be many facts relating to the possible Microsoft acquisition of Yahoo, with different times, sources and other differences, but all relating to the “canonical” fact:
EVENT_TYPE=ACQUISITION
ACQUIRER=MICROSOFT
ACQUIREE=YAHOO
A system can thus count the number of events relating to this canonical fact, and use the canonical fact count as a complementary way of ranking events.
Other Ranking Methods
The ranking of a document in which a fact occurs is also a useful measure. Using the Google PageRank of the document is the most straightforward way to do this.
Flowchart for Event Ranking
In summary, referring to
The IER is calculated when a fact is added to the database, using semi-static information from the database about entity ranking, source credibility and potentially additional information (document page ranking would be one example). The IER for a fact can be recomputed at any time, using updated input data.
The DER methods iteratively update the DER value (or values) of each fact, using the IER and DER values of all other facts to which it relates.
Illustrative Implementation Outline
Initial tsup_entity Table
Built initial tsup_entity table based on entity table and equivalence file.
Adding to the tsup_entity Table
When a new extractor_entity is found (not already in the tsup_entity table) create a new tsup_entity.
Logging
Every N minutes (currently, N=30):
Acquire current time (CT)
For each tsup_entity:
For each tsup_entity:
For each entity category (Company, Person, Country . . . ):
For each tsup_entity:
Find source_credibility sc
Find tsup_entity aggregate_rank er1 . . . ern of each entity related to the event
Compute two IER values as:
IER_mean=sc*sum(er1 . . . ern)/n
IER_max=sc*max(er1 . . . ern)
Store these two rank values for the event
NOTE: other functions might be used to calculate other IER values.
Derived Event Ranking
As described in the paragraph above.
Ahmad Shah Massoud was the head of the Northern Alliance in Afghanistan. He was killed on Sep. 9, 2001. A search in the event database on September 9th would have classified this to be a relatively unimportant event, and this would have been in accord with general sentiment about its importance. 48 hours later, however, it was deemed to be an extremely important event.
This change in importance resulted from reports of Massoud's killing being published on or just around Sep. 11, 2001 in high ranked sources, along with/co-occurrence in documents with another very key event—the 9/11 terrorist attacks (that for various reasons have been ranked high). Accordingly, 48 hours later the event's generally agreed-upon importance and likewise its ranking in our system, would be high.
While the temporal co-occurrence of the two events would no doubt be sufficient by itself to greatly increase the ranking of the Massoud story, the following factors could also have an effect:
1) Its importance within its category (e.g., Afghan politics or Middle East politics) was high, and that category became more important after 9/11.
2) Sentiment toward the Northern Alliance may have changed after 9/11.
3) There may have been a relatively close proximity measure between the 9/11 attacks and Massoud's name in the co-occurring documents.
4) The assassination and the 9/11 attacks are similar types of events (terrorist attacks involving explosives).
5) The two events were both involved the same lower-level hierarchical entity, “Afghanistan,” rather than higher-level entities, such as “Middle East.”
It is important to distinguish between publication times for a fact and the occurrence time of the fact. Articles that predict an impending bankruptcy, for example, may describe its expected date with more and more precision as time progresses. Later-published articles can therefore be weighted more heavily in predicting the date of a fact. Trends about the expected time of occurrence of a fact can even be extracted from the evolution of its prediction in articles over time. And these trends can point to a more accurate prediction of the date of occurrence of the fact.
Referring to
The present invention has now been described in connection with a number of specific embodiments thereof. However, numerous modifications which are contemplated as falling within the scope of the present invention should now be apparent to those skilled in the art. It is therefore intended that the scope of the present invention be limited only by the scope of the claims appended hereto. In addition, the order of presentation of the claims should not be construed to limit the scope of any particular term in the claims.
This patent application claims the benefit of provisional application 61/205,567, filed on Jan. 21, 2009, which is herein incorporated by reference. This patent application also relates to the subject matter of U.S. provisional application No. 60/940,643, filed on May 29, 2007, U.S. provisional application No. 61/068,967, filed on Mar. 11, 2008, and U.S. patent application Ser. No. 12/156,455, filed on May 29, 2008, which are all herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
6473084 | Phillips et al. | Oct 2002 | B1 |
7363243 | Arnett et al. | Apr 2008 | B2 |
7454430 | Komissarchik et al. | Nov 2008 | B1 |
7502789 | Yao et al. | Mar 2009 | B2 |
7570262 | Landau et al. | Aug 2009 | B2 |
7668813 | Baeza-Yates | Feb 2010 | B2 |
7844483 | Arnett et al. | Nov 2010 | B2 |
7849079 | Chandrasekar et al. | Dec 2010 | B2 |
20030135445 | Herz et al. | Jul 2003 | A1 |
20050060312 | Curtiss et al. | Mar 2005 | A1 |
20050234877 | Yu | Oct 2005 | A1 |
20070094219 | Kipersztok | Apr 2007 | A1 |
20070143300 | Gulli et al. | Jun 2007 | A1 |
20070150335 | Arnett et al. | Jun 2007 | A1 |
20070162850 | Adler et al. | Jul 2007 | A1 |
20070198503 | Hogue et al. | Aug 2007 | A1 |
20080086363 | Kass et al. | Apr 2008 | A1 |
20080215546 | Baum et al. | Sep 2008 | A1 |
20090030899 | Tareen et al. | Jan 2009 | A1 |
20090048927 | Gross | Feb 2009 | A1 |
20090049038 | Gross | Feb 2009 | A1 |
20090049041 | Tareen et al. | Feb 2009 | A1 |
20090132689 | Zaltzman et al. | May 2009 | A1 |
20090157667 | Brougher et al. | Jun 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
20100299324 A1 | Nov 2010 | US |
Number | Date | Country | |
---|---|---|---|
61205567 | Jan 2009 | US |