This invention relates to methods and apparatus for scoring media sources, including methods and apparatus that dynamically and automatically score media sources on their ability to predict events for each of a number of event types
The above-referenced applications provide a system for predicting facts from sources such as internet news sources. For example, where an article references a scheduled future fact in a textually described prediction, such as “look for a barrage of shareholder lawsuits against Yahoo next week,” the system can map the lawsuit fact to a “next week” timepoint. Deriving occurrence timepoints from content meaning through linguistic analysis of textual sources in this way can allow users to approach temporal information about facts in new and powerful ways, enabling them to search, analyze, and trigger external events based on complicated relationships in their past, present, and future temporal characteristics. For example, users can use the extracted occurrence timepoints to answer the following questions that may be difficult to answer with traditional search engines:
What will the pollen situation be in Boston next week?
Will terminal five be open next month?
What's happening in New York City this week?
When will movie X be released?
When is the next SARS conference?
When is Pfizer issuing debt next?
Where Will George Bush be next week? (see page 8, paragraphs 2-3)
In one general aspect, the invention features a computer-based method for extracting predictive information from a collection of stored, machine-readable electronic documents that includes accessing at least a subset of the electronic documents each including different machine-readable predictive information about one or more future facts occurring after a publication time for that document, and each with an identified source. The method includes extracting the predictive information about the one or more future facts from the accessed documents, acquiring verified information about one or more of the facts, and evaluating a measure of quality of the predictive information extracted from the documents based on the verified information about the facts.
In preferred embodiments the method can further include the step of associating a result of the step of evaluating for each of the documents with its corresponding document. The method can further include the step of associating a result of the step of evaluating for each of the documents with a source for its corresponding document. The step of associating can update a speed-of-prediction score for at least one of the sources. The step of associating can update a quality-of-prediction score for at least one of the sources. The steps of accessing, extracting, acquiring, evaluating, and associating can be repeated for a number of documents from a number of sources to derive and continuously update a set of scores for a plurality of sources. The method can further include the step of deriving a likelihood measure for at least one future event based on a set of predictions by different sources and the scores of those sources. The step of extracting can employ natural language processing by a computer. The step of accessing can access documents before the facts that they predict occur, with the documents being associated with a publication time that includes a machine-readable publication date, and with the step of evaluating updating a ranking of sources. The step of evaluating can evaluates a measure of how well a source is followed by other sources with the step of updating a ranking updating a ranking based on this measure. The step of evaluating can evaluate a measure of how quickly a source predicts a fact with the step of updating a ranking updating a ranking based on this measure. The step of evaluating can evaluate whether sources predict facts first with the step of updating a ranking updating a ranking based on this measure. The step of acquiring verified information about one or more of the facts can acquire verified information that includes if the facts did occur, and if so when. The steps of accessing, extracting, acquiring, and evaluating can be performed for a number of different groups of sources of different types.
In another general aspect, the invention features a computer-based apparatus for extracting predictive information from a collection of stored, machine-readable electronic documents from a plurality of different sources. The apparatus includes an interface for accessing at least a subset of the electronic documents each including different machine-readable predictive information about one or more future facts occurring after a publication time for that document, and each with an identified source, a predictive information extraction subsystem operative to extract predictive information about the one or more future facts from the documents accessed by the interface, and a source ranker responsive to verified information about one or more facts about which information is included in documents from a plurality of the sources and being operative to provide a measure of source quality to the predictive information extraction subsystem.
In preferred embodiments, the source ranker can provide a speed-of-prediction score for at least one of the sources. The source ranker can provide a quality-of-prediction score for at least one of the sources. The source ranker can be operative to derive and continuously update a set of scores for a plurality of sources. The predictive information extraction subsystem can employ natural language processing by a computer. The source ranker can be operative to evaluate a measure of how well a source is followed by other sources. The source ranker can be operative to evaluate a measure of how quickly a source predicts a fact. The source ranker can be operative to evaluate whether sources predict facts first. The source ranker can be operative to evaluate a number of different groups of sources of different types.
In a further general aspect, the invention features a computer-based apparatus for extracting predictive information from a collection of stored, machine-readable electronic documents from a plurality of different sources. The apparatus includes means for accessing at least a subset of the electronic documents each including different machine-readable predictive information about one or more future facts occurring after a publication time for that document, and each with an identified source, means for extracting the predictive information about the one or more future facts from the accessed documents, means for acquiring verified information about one or more of the facts, and means for evaluating a measure of quality of the predictive information extracted from the documents based on the verified information about the facts.
Systems according to one aspect of the invention help to optimize systems that extract predictive information from sources such as textual documents by scoring media sources on their ability to predict events. Referring to
The canonical and clustered events correspond to “real world events,” broken down by appropriate time period. I.e., all the natural disaster reports around Hurricane Irene can become grouped into a event cluster. Below such clustered/canonical events are for simplicity referred to as events.
Some sources (newspapers, blog, government sites, etc) are presumably consistently “better” at predicting events than others. Validated events are events that have been validated through a process including human curation/validation (experts, crowd, etc.). To be “good/better” at prediction can carry potentially different meanings, for example:
a. Being first to report upon validated events
b. Being first to initiate clusters (i.e. break news stories)
Referring to
Execute the below on historical archive
Sort all sources S for each ET, rank ordered by PS, and normalize PS from 0-100
The system described above has been implemented in connection with special-purpose software programs running on general-purpose computer platforms, but it could also be implemented in whole or in part using special-purpose hardware. And while the system can be broken into the series of modules and steps shown for illustration purposes, one of ordinary skill in the art would recognize that it is also possible to combine them and/or split them differently to achieve a different breakdown, and that the functions of such modules and steps can be arbitrarily distributed and intermingled within different entities, such as routines, files, and/or machines. Moreover, different providers can develop and operate different parts of the system.
The present invention has now been described in connection with a number of specific embodiments thereof. However, numerous modifications which are contemplated as falling within the scope of the present invention should now be apparent to those skilled in the art. Therefore, it is intended that the scope of the present invention be limited only by the scope of the claims appended hereto. In addition, the order of presentation of the claims should not be construed to limit the scope of any particular term in the claims.
This application claims the benefit under 35 U.S.C. 119(e) of U.S. provisional application Ser. No. 61/563,528 filed Nov. 23, 2011, which is herein incorporated by reference. This application is related to U.S. Application Serial Nos. 20100299324 and 20090132582 both entitled Information Service for Facts Extracted from Differing Sources on a Wide Area Network as well as to U.S. Application Ser. No. 61/550,371 and Ser. No. 13/657825 both entitled Search Activity Prediction, which are all herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61563528 | Nov 2011 | US |