The invention relates to the field of geocoding and more particularly to a method and apparatus for geocoding with improved performance and accuracy of candidate match return through a number based input/output match.
Geocoding is a process of transforming and translating non-spatial location descriptive text, commonly referred to as an address, into a valid spatial representation by comparing location-specific elements to those in reference data. More specifically, geocoding involved programmatically assigning x and y coordinates (usually, but not limited to, earth coordinates—i.e., latitude and longitude) to records, lists and files containing location information (full addresses, partial addresses, zip codes, etc.). The geocoding process is typically based on the following characteristics: (i) Reference data: consisting of the geographically coded information which will serve as a base to derive the appropriate geographic code for some, (ii) the addresses to be assigned with a geographical reference: the address a user wishes to have geographically referenced and which contains attributes capable of being matched to the reference (iii) Output: geographic coordinates with precision results, and (iv) a decision algorithm: the methodology employed to get a match with the reference data by the process that includes address parsing, normalization, and weighting of the input dataset with that of the reference dataset.
A reference data library is compiled from a variety of sources which range from administrative information, postal addresses, census information, street vectors, Point of Interests (POIs) and ancillary information on location geometry which constitutes a physical address. When an input address is given, the reference data library is searched to fined matches to an ever decreasing precision geographic hierarchy of point, line or polygon boundary until a preset tolerance for a suitable match is met.
The search process for an address can be explained in a simplified manner as follows. To search for address “951 Spruce St, Louisville, Colo. 80027, USA”, the geocoder process must perform a hierarchy of text search and match from the highest to the lowest administrative levels followed by street searches and house number. The search navigates in hierarchy from country, state, district, city, postcode, street, house number and unit number to derive best match as an output. The amount of data scanned for matches mandates a highly efficient system with a fast candidate retrieval. With text based searches and matches the efficiency for fast candidate retrieval is not optimized.
To date, various geocoding software return output candidates based on string match algorithms. As a result, matching and weighting takes time before providing the best match candidate. Further, the complexity increases in order to retrieve exact/close matches if variations exist in the provided input address. There is a need for a more accurate solution that enables quick candidate matches to be determined and provided to the user.
According to embodiments of the invention, an automated computer geocoding system that improves the geocoder performance in comparison to traditional functionality of geocoding software is provided. The present invention utilizes a best candidate return in conjunction with a matched geocoded location for given geographic boundaries through number matching instead of string matching to achieve positional accuracy not currently obtainable in the prior art.
Therefore, it should now be apparent that the invention substantially achieves all the above aspects and advantages. Additional aspects and advantages of the invention will be set forth in the description that follows, and in part will be obvious from the description, or may be learned by practice of the invention. Moreover, the aspects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings illustrate presently preferred embodiments of the invention, and together with the general description given above and the detailed description given below, by way of example serve to explain the invention in more detail. As shown throughout the drawings, like reference numerals designate like or corresponding parts.
The geocoding process has undergone marked transitions to accommodate and exploit changes in parsing, normalization, and weighting to return the best match. Despite progress made in the process, the performance of matching and retrieval is slow, due to the time required to perform text to text based matching across the input candidate and reference data. As set forth above, prior art geocoding methods and apparatus are dependent on string matching between input and reference data for the best candidate retrieval. The processing time to perform string matching as compared with number matching is quite significant. This is because text-based searches require thorough scans of characters looking for instances of a given match and weights need to be assigned for variations to output the best possible candidate.
In accordance with the present invention, to provide the best candidate return (exact/close), input strings can be converted to numbers to match reference records and hence retrieval will be faster and more efficient, without requiring much effort in an underlying georeferenced address dictionary.
Reference is now made to
The reference data stored in databases 20, 22, 24 is an important component of any geocoder system because the addresses that are input and locations that are eventually derived are matched against a set of attribute values of the reference data. Point data in database 20 are datasets where a single latitude and longitude is provided for a specific address. Segment data in database 22 are datasets where a street segment line, often as a street centerline, is provided and interpolation is employed to relate the street centerline to a specific address for the address. Parity rules such as odd and even addresses lying on different sides of the street segment can also be employed. The street segment centerline dataset in database 22 contains coordinates that describe the shape of each street and usually the range of house numbers found on each side of the street. The geocoding system 10 may compute a location for an address by linear interpolation of the street number with respect to the street address range. Other types of interpolation may also be used, such as squeeze distance (which might, for example, take into account a known characteristic that addresses are closer together at one end of the segment) and parity rules to determine a physical location for an address. The point level datasets in database 20 result in higher quality addresses accuracy than those requiring the interpolation technique. The geographic dataset in database 24 will typically include data describing the geographic boundaries of different regions. For example, it might include the boundaries of different municipalities or zip code areas. If an address cannot be located in the point database 20 or the segment database 22, then a corresponding location may be assigned as being somewhere in a city, or zip code that is included in the address. Typically the corresponding location that is selected will be a centroid of that geographic area. Determination of a physical location by using this data will most often result in the biggest potential offset distance, but may still be useful for many purposes. The segment data in database 22 is a group of street segments. Each street segment contains a group of latitudes and longitudes (i.e., a group of ordered points), and there is assumed to be a sub-street segment of the street in a straight line between the two points at the end of each street segment. A street segment must have at least two points, but can have many points. Most street segments contain a house number range (an address range) and reverse geocoding to a street segment works by interpolating the house number based on the house number range. The point data in database 20 is a group of point data locations, which are, essentially, latitudes and longitudes of the rooftops of addresses. This data allows precise pinpointing of an address to an exact location, whereas the street segment data above requires interpolation. This is not necessary for a point data match. There is usually only one house number associated with a point in the point data. When there are multiple house numbers, it means the point is a feature such as a high rise building, in which case a convention may be implemented such as returning as a match the lowest available unit. The reference data stored in databases 20, 22, 24 can be built similar to conventional approaches, and according to the present invention changes as described below are made to the database construction.
The point reference dataset stored in database 20, including attributes such as, for example, postal, address and geography point, is composed of point features with required geocoding attributes as illustrated in Tables 1-3 below.
A linear-based or line-based reference dataset, as stored in database 22, is composed of lines/polylines features with required geocoding attributes as illustrated in Table 4 below.
According to the present invention, some additional fields, such as for example, Base Value, ASCII Code values, logarithmic values (at base 10) and threshold value fields as shown in Table 5 have been calculated based on a conversion function as described below and are added to the datasets that are stored in databases 20, 22, 24 in respective data tables such as geography points, postal points, address points, street segments, etc.
The base value will be used as a starting value to which the ASCII code is concatenated for the Logarithmic Value calculation. The base values are designed to keep a variability factor in address components like aliases, phonetics, transliterations, etc., and were determined based on different permutations and combinations to handle names and its aliases. The base numbers are kept large enough to differentiate across address elements when log values are calculated. The base value will differentiate address elements log values and will be useful in traversing address elements searches in hierarchical fashion as the search result will narrow down from the country to the lowest level of hierarchy. Various base value levels defined are listed below in Table 6. These values have been determined based on different permutations and combinations as noted above. Geographic addresses of various countries were analyzed and various geocoding address examples were worked out to determine proper base values.
The base value will be concatenated with the string ASCII value of the Address Element record and a logarithmic value will be derived. These derived log values will be stored in the database as explained in Table 7 below and further illustrated through an Address string example.
The Address String “951 Spruce St, Louisville, Boulder, Colo., 80027 United States” was parsed into different constituents such as country, state, county, postcode, city, etc. The text string of the address records were converted to ASCII numbers based on alphabet to ASCII lookup values as illustrated in Table 8.
Once the ASCII numbers were obtained, these were concatenated with varying base numbers of parsed elements (derived through permutations and combinations of optimal base value computations). The text marked in bold italics below represents the base value of the parsed elements, for country the base value is different than base value of State or its equivalent hierarchy.
The logarithmic value (base 10) was calculated for the concatenated numbers (Base+ASCII). These log values were then stored in the database along with text information for faster lookup and query from the reference dataset. The same log values will be assigned to address element variations for aliases. For example, the Log Value for a Country name and Country ISO3 or Country ISO2 and Aliases will be the same. For example, for the United States, the Country Name is United States, the ISO3 is USA, and the ISO2 is US. All three of these will store the same log value, i.e. 65.
Reference data created as described above with the numeric values calculated based on the conversion function will assist in faster performance and response time as opposed to conventional reference dictionaries. The modified reference dataset is then stored in the databases 20, 22, 24 for use by the geocoding system 10.
Reference is now made to
Thus, using the geocoding process of the present invention results in a faster output candidate retrieval based on the combination of the geocoding process and pre-calculated numeric values in the reference data. While preferred embodiments of the invention have been described and illustrated above, it should be understood that they are exemplary of the invention and are not to be considered as limiting. Additions, deletions, substitutions, and other modifications can be made without departing from the spirit or scope of the present invention. Accordingly, the invention is not to be considered as limited by the foregoing description but is only limited by the scope of the appended claims.