Claims
- 1. A system for identifying and matching company names and business events occurring in a document, the system comprising:
a. a crawler for downloading documents; b. a parser for parsing the downloaded documents; c. an evaluator for evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in each of the parsed documents; and d. an information extractor for identifying and matching business events to company names, the business events and company names being present in the selected documents.
- 2. The system of claim 1 wherein the crawler downloads documents identified by a pre-defined first set of links.
- 3. The system of claim 1 wherein the parser for parsing the downloaded documents breaks down the downloaded documents into components, the components comprising at least one of free text, titles and a second set of links to other documents.
- 4. The system of claim 1 wherein for each of the documents the amount of relevant information corresponds to a text portion of the document, and wherein the score corresponds to a ratio of the amount of relevant information to a size of the document.
- 5. The system of claim 1 wherein the information extractor further identifies co-references of the company names occurring in the text contained in the selected documents, the co-references being substitutes that are used to refer to company names in different parts of the text in the selected document.
- 6. The system of claim 1 wherein the information extractor further computes a match score for each of the matches found in each of the selected documents on the basis of a distance between a company name and a business event that constitute a match in the selected document.
- 7. The system of claim 1 wherein the information extractor further generates a confidence rating for each of the selected documents, the confidence rating being based on contributions from the matches between business events and company names and the contribution from the orphan events in the selected document.
- 8. A method for identifying and matching company names and business events, the method comprising the steps of:
a. crawling a first set of links on a network to download documents; b. parsing the downloaded documents; c. evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in the document; and d. processing the selected documents to generate company-event pairs from information present in text contained in the document.
- 9. The method of claim 8 wherein the step of crawling a first set of links comprises the steps of:
a. identifying the first set of links, the links being references to locations of documents on the network; and b. downloading the documents available at the locations on the network referenced by the first set of links.
- 10. The method of claim 8 wherein the step of parsing the downloaded documents comprises the steps of:
a. breaking down the documents into individual components, the components comprising at least one of free text, titles and a second set of links to other documents on the network; and b. adding the second set of links to the first set of links used for crawling.
- 11. The method of claim 8 wherein the step of evaluating the parsed documents further comprises the steps of:
a. assigning an information quantity score to each of the parsed documents on the basis of amount of relevant information contained in the parsed documents; and b. selecting the documents on the basis of the information quantity score assigned to the parsed documents.
- 12. The method of claim 11 wherein the information quantity score of the parsed document is computed as a ratio of free text contained in the document to a size of the document.
- 13. The method of claim 8 wherein the step of processing the selected documents to generate company-event pairs further comprises the steps of:
a. identifying the occurrences of company names and their co-references in each of the selected documents, the co-references being substitutes that are used to refer to company names in different parts of the text; b. identifying the occurrences of business events in each of the selected documents; and c. matching the identified business events to the identified company names in each of the selected documents.
- 14. The method of claim 8 wherein the step of processing the selected documents to generate company-event pairs further comprises computing a match score for each match between an identified company name and an identified business event, the match score being calculated on the basis of a distance between the identified company name and the identified business event in the document.
- 15. The method of claim 8 wherein the step of processing the selected documents to generate company-event pairs further comprises generating a confidence rating for each of the selected documents, wherein the confidence rating is calculated on the basis of contribution from matches between business events and company names and contribution from orphan events within each of the selected documents, the orphan events being business events that are not associated with any company name in the selected document.
- 16. A computer program product comprising a computer usable medium having a computer readable program code embodied therein for identifying and matching company names and business events, the computer program code performing the steps of:
a. crawling a pre-defined first set of links to download documents referenced by the pre-defined first set of links; b. parsing the downloaded documents; c. evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in the document; d. identifying company names and business events in the text contained in each of the selected documents; and e. matching the identified business events to the identified company names for each of the selected documents.
- 17. A system for identifying and matching company names and business events, the system comprising:
a. a crawler for downloading documents, the documents being referenced by links present in a pre-defined first set of links; b. a parser for parsing the downloaded documents to break the downloaded documents into components including at least one of free text, title and a second set of links; c. an evaluator for evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in the documents; and d. an information extractor for identifying and matching business events to company names, the business events and company names being present in the text contained in the selected documents; wherein the information extractor comprises:
i. a company name extractor for identifying company names in the text contained in the selected documents; ii. a business event extractor for identifying business events in the text contained in the selected documents; and iii. an entity-event matcher for matching the identified business events to the identified company names for each of the selected documents and computing a match score for each of the matches in each of the selected documents. iv. a confidence rating generator for generating a confidence rating for each of the selected documents.
- 18. The system of claim 17 wherein the entity-event matcher computes a match score for each match between an identified company name and an identified business event in a selected document based on a distance between the identified company name and the identified business event in the selected document.
- 19. The system of claim 17 wherein the confidence rating generator generates a confidence rating for each of the selected documents, wherein the confidence rating is calculated on the basis of contribution from matches between business events and company names and contribution from orphan events within each of the selected documents, the orphan events being business events that are not associated with any company name in the selected document.
- 20. A method for identifying and matching company names and business events, the method comprising the steps of:
a. crawling a network to download documents referenced by a pre-defined first set of links; b. parsing the downloaded documents to break down the downloaded documents into components, the components comprising at least one of free text, titles and a second set of links to other documents; c. evaluating the parsed documents to select documents on the basis of an information quantity score, the information quantity score being a measure of amount of relevant information contained in the parsed document; d. identifying the occurrences of business events in text contained in the selected documents; wherein identifying the occurrences of business events in text contained in the selected documents involves:
i. identifying the business events in the text by locating phrases exactly as they occur in the pre-defined set of phrases; and ii. identifying the business events by searching the text for variations of the phrases present in the pre-defined set of phrases; and e. identifying occurrences of company names in text contained in the selected documents; wherein identifying the occurrences of company names in text contained in the selected documents involves:
i. identifying the occurrences of company names in the text by searching for a set of company name suffix indicators in the text; ii. applying a pre-defined set of heuristics to identify the company name preceding the identified company name suffix indicator; and f. matching identified business events to identified company names to generate company-business event pairs; wherein matching identified business events to identified company names to generate company-business event pairs involves:
i. determining a match between the identified business events and the identified company names for each of the selected documents; ii. computing a match score for each of the matches in each of the selected documents, the score being based on a distance between the identified company name and the identified business event in the selected document. iii. calculating a confidence rating for each of the selected documents, wherein the confidence rating is calculated on the basis of contribution from matches between business events and company names and contribution from orphan events within each of the selected documents, the orphan events being business events that are not associated with any company name in the selected document.
- 21. The method of claim 20, wherein the contribution from matches between business events and company names occurring in a selected document is determined by calculating an average of scaled match score values of all matches in the selected document.
- 22. The method of claim 20, wherein the contribution from orphan events occurring in a selected document is determined by taking a minimum value among all scaled match scores and multiplying this value with a quotient of the number of orphan events and the total number of orphan events and positive matches occurring in the selected document, the positive matches being matches that have a forward reference or a backward reference associated with them.
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation-in-part of application Ser. No. 10/218,620, entitled “Method And System For Event Phrase Identification,” assigned to General Electric Capital Corporation, filed on Aug. 15, 2002, which is hereby incorporated by reference.
Continuation in Parts (1)
|
Number |
Date |
Country |
Parent |
10218620 |
Aug 2002 |
US |
Child |
10336545 |
Jan 2003 |
US |