The present invention relates generally to mapping product identifiers for the same products from different sources.
More and more consumer activities on the Internet and particularly in e-commerce involve finding deals about products and items. Several websites aggregate product information from multiple sources to provide deal information to consumers. Implicit in such aggregation activities is the assumption that different product descriptions of the same product in multiple sources can be identified, i.e., mapped.
Embodiments of the present invention address deficiencies of the prior art and introduce new technologies to the present art for integrating disparate descriptions of products and items from different data sources. In some cases conflicting identifier data is resolved to accomplish integration under rules of consistency, prior knowledge of data sources, and heuristics. Data healing is undertaken in certain situations so that a resolution of different descriptions may occur.
In one embodiment a method is described that creates a master identifier for uniquely identifying each item in a set of items. This master identifier is created from identifiers provided within the individual descriptions.
In another embodiment a method is described that assigns a score to each identifier comprising the master identifier and further computes a weighted total score for the master identifier. The score of the master identifier may be compared against a pre-determined threshold value to determine potential equivalence of products. The method envisions the use of frequency of occurrence of identifier values in computing the weighted sum values.
In another embodiment a method is described to determine if two or more items in a set of items are potentially distinct items, each item being described by a set of identifiers with values associated therewith, the set of items having a master identifier uniquely identifying each item in the set of items, the master identifier including one or more of the identifiers. The method comprises comparing the master identifiers for the two or more items and determining if the items are distinct items if their master identifiers are neither equal to one another nor consistent with one another. The method further comprises comparing the master identifiers for the two or more items and determining if the items are either equal to one another or consistent with one another in the event that a value for one or more of the input identifiers is unknown, missing or unavailable.
In another embodiment of the invention a method is described that deals with identifier values that are missing, unavailable or unknown. The method assigns values to the missing identifiers in a mutually consistent manner so that equivalence of items can be determined, if possible.
In another embodiment of the invention a method is described that, given an input description, quickly and easily locates potentially equivalent descriptions from a large data store of descriptions. The method assigns bit streams to each stored description and the input description in such a manner that a simple Boolean logic operation yields all the potentially equivalent descriptions to the input description. The method envisions implementing the embodiment in hardware, firmware and/or assembler language instruction sets.
In another embodiment of the present invention a method is described that, given an input description and a collection of potentially equivalent descriptions, checks for the equivalence of descriptions if a consistent set of assignments can be made of values to missing or unavailable identifier values. If such a consistent set of assignments cannot be found the method envisions the use of heuristics to head data and then apply the process of resolving descriptions again. The method further envisions assuming a known identifier value to be erroneous and replacing it with another value, such replacement yielding a consistent assignment of values to identifiers. The method further envisions assigning a probability estimate to a derived equivalence of descriptions.
In another embodiment of the present invention, in order to achieve a possible equivalence of descriptions, a method is described that declares certain identifier values to be erroneous and replaces such values with heuristic estimates to obtain an equivalent description with a probabilistic estimate of correctness.
The inventions will now be more particularly described by way of example with reference to the accompanying drawings. Novel features believed characteristic of the inventions are set forth in the claims. The inventions themselves, as well as the preferred mode of use and further objectives and advantages thereof, are best understood by reference to the following detailed description of the embodiment in conjunction with the accompanying drawings, in which:
In the descriptions that follow, we will adopt the following usage of terms (however, the inventions presented herein shall not necessarily be limited by such usage):
A “product identifier” or “identifier” is an attribute associated with an item such as a product and which is extracted from a description of the item obtained from a data source such as a web site. Examples: UPC, title, price, etc.;
A “master identifier” consists of a particular subset of a set of identifiers that may be used to uniquely identify a product;
A “set of identifiers” or a “plurality of identifiers” (such as used in the product descriptions P1, P2, P3 and P4 shown below) is a group of identifiers describing a product or item;
A “web page”, in general, denotes a set of information objects being displayed on a computer monitor and accessible through a web browser such as Internet Explorer;
The term “web page being displayed” will generally refer to the process by which a web browser renders a web page causing it to be displayed on a computer monitor; and
A “website” comprises a collection of web pages at a single internet address, said web pages provided to web browsers by a web server.
The present invention relates to searching and identifying content on the Internet. Recent search requests more generally involve individual products, services and other items. Such requests are expected to increase as electronic commerce activity grows on the Internet. Implicit in such requests is the notion of comparisons of items across websites. For example, in order to find the cost of, say, a flight or a particular television set, various flights and television sets have to be compared across multiple websites. For instance, consumers can be provided with information on the cheapest price for a particular product across multiple merchants (websites) or user comments and other information for that product across multiple data sources. Information about products can be obtained from a wide variety of sources including, e.g., data feeds, APIs, bar codes, user generated data, and data that has been scraped from websites.
A problem with such comparisons is that one must ensure that the same product, service or other item is being compared across different sources.
Individual products, services or other items are identified on a website by using unique identifiers (IDs). Such IDs are often channel, merchant, or manufacturer specific, and thus not global. IDs may also be completely missing as there may be no numbering scheme widely adopted in a particular business segment such as, e.g., artisan/hand crafted products such as wines, among others. Even when products have globally unique identifiers like a UPC (Universal Product Code), an EAN (European Article Number), or a GTIN (Global Trade Item Number), the IDs used for products may be wrong, missing or misplaced.
Consequently, a mapping service is needed to map multiple product descriptions as one when they identify the same product. The mapping service can be used to understand, map and represent deals of products from multiple sources. For example, the mapping service can be used to determine that a price or any other structured or non-structured information from one source is also applicable to the same product having a different or no identifier from another source.
One aspect of the mapping problem is that the mapping process may need to consider thousands or millions of products emanating from various sources such as data feeds, scraped web sites, etc. A new product description may need to be mapped against millions of potential descriptions that will take more time and computer resources.
The present invention provides a solution to the mapping problem in which the number of operations needed to determine a successful or unsuccessful mapping is reduced. Moreover, each operation uses considerably less time and computing resources.
The mapping problem may be stated in abstract terms as follows. We are given a database or a collection, i.e., a large number, of product descriptions that are assumed to describe a variety of products. We are then given a new product description. We are required to determine if the new product description is “equivalent” to any of the descriptions in the collection.
Consider the method depicted in
Forming Product IDs
The method of the present invention uses source information and manufacturer and product attributes such as title, historical information such as price, and other real-time and non-real-time pieces of data together to form a master identifier that can be used to globally identify the product or entity in question.
Consider, by way of example, the situation depicted in
In
In another embodiment the initial Master ID is based on a selected provider of a product. The selection is based on business motivators and other criteria, such as “source S is known to have reliable descriptions”, “using source T implies certain limitations that lower its value as a master identifier”, “source U in general gives good Cost-per-Action revenue” etc. The remaining descriptions are then matched against the Master ID, and that match is given a score. If a match score is high enough, the corresponding descriptions are merged and the process continues with the enriched data.
One major use of Master IDs is to determine when two products are distinct or if the descriptions could be merged into a single product. The method of the present invention takes the distinctiveness condition to be true if the Master IDs of the two products cannot be made to agree with each other. For example, if p1 is the product description with Master ID (UPC=123, EAN=456) and the description p2 has the Master ID (UPC=949, EAN=343) then the two Master IDs cannot be equated with each other (unless one or more identifier values are assumed to be incorrect or erroneous). However, if p1 has the Master ID (UPC=123, EAN=unknown) and p2 has the Master ID (UPC=unknown, EAN=456) then we can equate the two Master IDs consistently with each other by assuming that the unknown EAN value is “456” and the unknown UPC value is “123”. In other words if there does not exist a substitution of “values” for “unknowns” in two Master IDs that makes them consistent with each other then the two corresponding products are distinct (unless we assume that some identifier values are incorrect). We thus observe that the notion of consistency of two descriptions determines potential compatibility or otherwise of the two descriptions.
In an alternative embodiment the Master ID is an assigned value that collects multiple provider product descriptions into one collection, one of which is the master copy and the others are used to enrich that. Such as master product [UPC=123, TITLE=xyz], enriched by [UPC=123, EAN=456, TITLE=xyz] gives a more complete single description [UPC=123, EAN=456, TITLE=xyz].
With the above exposition in mind consider
Now assume the input new description has Master ID (UPC, unknown), i.e., it has an associated string I=“10”. Now compute NOT(I XOR S) for each value of column S. The result is shown in the last two columns in
We now make the following definition. If a value in the last column of
The Subordinate Methods S1 and S2
The S1 method receives as input a collection of descriptions known as PC and a description known as “input description” and it needs to determine if the elements of the collection are consistent with the input description, i.e., equates the corresponding descriptions. The method operates by utilizing the notion of a substitution. Given an identifier with a known value and another identifier with an unknown value, a substitution replaces the unknown value with the known value. If unknown values cannot be consistently replaced then a substitution does not exist. For example, consider the following potentially consistent descriptions A=(UPC=123, EAN=456) and B=(UPC=123, EAN=unknown). The substitution unknown=456 is consistent. Now consider the case of a third description C=(UPC=123, EAN=789), which is also potentially consistent with descriptions A and B. There is no consistent assignment of values to the unknown identifier that equates all three descriptions. The merge method operates by finding a consistent substitution that equates the input description with the descriptions in the given group of descriptions. If a consistent substitution does not exist the merge method transitions control to the Heuristic Method and terminates.
The working of the S1 method as described above is shown in
The Method S2
In step 100 the method receives the input and in step 200 attempts to determine if the identifier values in the input description and the descriptions in the group PI agree. If no agreement is found, the method transitions to the heuristic method (step 300). Otherwise, in step 500 it transitions to step 200 of
In an alternative embodiment to methods S1 and S2 the data can be “healed” by replacing values considered erroneous. The Master ID is enriched with known provider data and where new identifiers (ID) are seen, the result can be:
An ID with a different value than one already merged into the Master ID will need to overcome a negative matching score by the provider product data having other (stronger) matching values or explicit curation.
The heuristic scoring method is used in all matches of the provider data to the master data.
Heuristic Method
The heuristic method performs two main functions.
In the first case it receives as input a group of descriptions for whom a consistent substitution has not been found. It is required that either the collection of descriptions be declared as belonging to distinct products or some remedial measure is needed. Consider, by way of example, the following three descriptions, as indicated by their Master IDs, from the above exposition.
There is no consistent substitution that will equate the three descriptions. So, it is possible that we are dealing with three distinct products, or with two distinct products. The latter case can be effectuated by assuming that “unknown” value for the description B has the value 456 which will equate the descriptions A and B. Alternatively, one may assume that the unknown value is 789 which equates the descriptions B and C.
In the second case, the heuristic method receives as input a group of descriptions in which the identifiers values are in disagreement. For example, consider the two descriptions, as indicated by their Master IDs.
It is required that the heuristic methods take remedial action and make the descriptions equivalent, or declare them as distinct. In this example one remedial course of action could be to declare one of the UPC values as erroneous, say UPC=789, and assume that it is UPC=123 as a corrected value.
Thus the heuristic method and system is required to make decisions programmatically that are based on assumptions regarding missing identifier values, or incorrect identifier values, etc. The heuristic system creates a “quantifiable probability” between the matches from the sources. The probability differs between the data and the source. The probability is calculated and is based on mathematical formula involving confidence in decisions based on prior known decisions. One such form of conditional probabilistic reasoning is derived from Bayes Theorem.
By way of example, the probability calculation can take into account the following:
If the method receives a globally unique identifier, it gives a strong weighting to the probability, e.g., UPC or GTIN can get scores of 80.
If the method receives manufacturer's part number that is only locally relevant and re-used many times, it gives it a lower score, e.g., 20.
If the method receives different identifiers, the same score can be used, but as negative, e.g., if the UPC does not match the score is −80.
if the method receives product title, manufacturer's business entity name, category, price or other such identifier values, the method uses heuristics to determine the score. The score depends on the strength of the match. The scores can be tuned and weighted based on historical information, categories and price points. The method and system supports the tuning of these scores and weights.
The method has a tunable threshold to decide if two product descriptions are of the same product. If the score is below the threshold the mapping does not occur. If the score is above the threshold the mapping occurs and identifiers, attributes, and other structured and non-structured data is mapped into the same product cluster.
The heuristic method and system allows manual curation. Descriptions may be declared explicitly to belong to, or not belong to a specific cluster.
The mapping methods described above may be implemented in software, hardware, firmware or any combination thereof. The processes are preferably implemented in one or more computer programs executing on a programmable computer system including a processor, a computer-readable storage medium readable by the processor (including, e.g., volatile and non-volatile memory and/or storage elements), and input and output devices. Each computer program could be a set of instructions in a code module resident in random access memory of the computer. Until required the program instructions could be stored in another computer memory (e.g., in a hard drive, or in a removable memory such as an optical disk, external hard drive, memory card, or flash drive) or stored on another computer system and downloaded via the Internet or some other network.
Accordingly, the foregoing descriptions and attached drawings are by way of example only, and are not intended to be limiting.
While the present inventions have been illustrated by a description of various embodiments and while these embodiments have been set forth in considerable detail, it is intended that the scope of the inventions be defined by the appended claims. Those skilled in the art will appreciate that modifications to the foregoing preferred embodiments may be made in various aspects. It is deemed that the spirit and scope of the inventions encompass such variations to be preferred embodiments as would be apparent to one of ordinary skill in the art and familiar with the teachings of the present application.
Additionally, elements and components described herein may be further divided into additional components or joined together to form fewer components for performing the same functions.
Accordingly, the foregoing description is by way of example only, and is not intended to be limiting.