Claims
- 1. A method of searching and matching input data to stored data, comprising:
receiving an input data having a plurality of elements, said input data representing a business entity; converting selected elements in said plurality of elements to a set of terms; and searching stored data for a plurality of match candidates based on said set of terms; providing a best match from said plurality of match candidates.
- 2. The method according to claim 1, wherein said converting step comprises:
parsing said plurality of elements to identify said set of terms, including a company name and an address; cleaning said set of terms, including removing extraneous words; and standardizing said set of terms.
- 3. The method according to claim 2, wherein said converting step further comprises:
validating said address having a street name and city name; correcting said street name and said city name, if necessary; and assigning a zip code, a latitude, and a longitude.
- 4. The method according to claim 3, wherein said converting step further comprises:
maintaining at least one reference table.
- 5. The method according to claim 2, wherein said converting step further comprises:
removing special characters in said set of terms; removing a last word in said company name if said last word is a standard company form; converting text in said set of terms to uppercase; depluralizing select text in said set of terms; standardizing select words in said set of terms; normalizing select phrases in said set of terms; and extracting a street number and a street name from said address.
- 6. The method according to claim 1, wherein said searching step further comprises:
generating a plurality of keys from said set of terms; limiting match candidates for certain keys in said plurality of keys that return counts surpassing a predetermined threshold; generating a cost function for select key intersections; prioritizing said key intersections according to said cost function; and retrieving said match candidates in order of said key intersections.
- 7. The method according to claim 1, further comprising:
generating a confidence score for each match candidate based on a degree of match.
- 8. The method according to claim 7, further comprising:
providing an ordered list of selected match candidates based on said confidence score.
- 9. The method according to claim 7, wherein said confidence score is based on comparison scoring.
- 10. The method according to claim 9, wherein said comparison scoring step comprises:
determining a score for a business name, a street name, and a city name in a pair, said pair being said set of terms and one of said match candidates; classifying said pair into data segments using a decision tree; performing logistic modeling using said data segments; determining a match probability for said pair; and assigning a grade to said pair.
- 11. The method according to claim 10, wherein said comparison scoring step further comprises:
determining a uniqueness score based on a number of matching business names in said city name.
- 12. The method according to claim 10, wherein said comparison scoring step further comprises:
calculating a business density score for said pair.
- 13. The method according to claim 10, wherein said comparison scoring step further comprises:
calculating a zip score.
- 14. The method according to claim 10, wherein said comparison scoring step further comprises:
calculating an industry score by matching words in said business name to standard industrial classification (SIC) key words.
- 15. A system for searching and matching input data to stored data, comprising:
a web services interface for accepting a match request and providing a best match, said match request including input data representing a business entity; a pre-processing layer having a cleaning, parsing, and standardizing component for converting said input data into a set of terms; an application layer having a match engine for processing said match request using said set of terms and producing said best match; and a database layer for retrieving match candidates from stored business entity information for said application layer.
- 16. The system according to claim 15, wherein said match engine comprises:
a decisioning component for determining said best match and an ordered list of said match candidates.
- 17. The system according to claim 16, wherein said web services interface also provides an ordered list of match candidates from said application layer.
- 18. The system according to claim 15, further comprising:
a plurality of memories in said pre-processing layer, said application layer, and said database layer; a plurality of asynchronous message queues in said pre-processing layer, said application layer, and said database layer; and a plurality of caching systems in said pre-processing layer, said application layer, and said database layer.
- 19. A computer readable medium having instructions for performing a method of searching and matching input data to stored data, said method comprising:
receiving a match request having a plurality of elements representing a business entity; pre-processing said plurality of elements to convert said plurality of element into a set of terms; retrieving match candidates by searching a database based on said set of terms; evaluating said match candidates to determine a best match; and providing said best match.
- 20. The computer readable medium according to claim 19, wherein said pre-processing step comprises:
parsing said plurality of elements to identify said set of terms, including a company name and an address; cleaning said set of terms, including removing extraneous words; and standardizing said set of terms.
- 21. The computer readable medium according to claim 19, wherein said retrieving step comprises:
generating a plurality of keys from said set of terms; limiting match candidates for certain keys in said plurality of keys that return counts surpassing a predetermined threshold; prioritizing said key intersections according to a cost function; and retrieving said match candidates in order of said key intersections.
- 22. The computer readable medium according to claim 19, wherein said evaluating step comprises:
determining a score for a business name, a street name, and a city name in a pair, said pair being said set of terms and one of said match candidates; determining a uniqueness score based on a number of matching business names in said city name; calculating a business density score for said pair; calculating a zip score; and calculating an industry score by matching words in said business name to standard industrial classification (SIC) key words.
CROSS-REFERENCE
[0001] The present application claims priority to U.S. Provisional Application Ser. No. 60/424,789 filed on Nov. 8, 2002.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60424789 |
Nov 2002 |
US |