Data includes features of various types including numeric, categorical, etc. A categorical feature can describe an entity such as a country, product name, product family, business name, business unit, etc. For example, sales opportunity data contains many features that describe the entities including the product, product family being sold, the business unit selling it and the customer who purchased the product. Such entities may undergo changes over time due to, for example, changes in organization structure, product family categorization or renaming, and mergers and acquisitions of companies, resulting in changes in the values of those entities.
The following detailed description references the drawings, wherein:
The following detailed description refers to the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the following description to refer to the same or similar parts. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only. While several examples are described in this document, modifications, adaptations, and other implementations are possible. Accordingly, the following detailed description does not limit the disclosed examples. Instead, the proper scope of the disclosed examples may be defined by the appended claims.
Data includes features of various types including numeric, categorical, etc. A categorical feature can describe an entity such as a country, product name, product family, business name, business unit, etc. For example, sales opportunity data contains many features that describe the entities including the product, product family being sold, the business unit selling it and the customer who purchased the product. Such entities may undergo changes over time due to, for example, changes in organization structure, product family categorization or renaming, and mergers and acquisitions of companies, resulting in changes in the values of those entities. These changes in entities, both names and context, pose a challenge to data analytics, as old entity values do not match new entity values in certain features. For example, when a company is acquired by another, the company's name will change.
Such changes in entities values over time may generate inconsistencies in the data. Data inconsistencies can pose many technical challenges. Suppose that a company has been collecting sales data for the past several years and wants to use the data to predict the outcome of a new sales opportunity. The business unit that created the product being sold may be a strong predictor of the outcome. However, it is possible that the company underwent a re-organization and/or renaming over the years such that the specific product is associated with various different names of business units in the past sales data. The mismatch in names of the business unit makes is difficult for a machine learning method to determine it as a strong predictor.
Examples disclosed herein provide technical solutions to these technical problems by identifying a feature that is common to a first dataset and a second dataset, wherein a first value of the feature in the first dataset is different from a second value of the feature in the second dataset; determining a first predicted value of the feature in the first dataset based on a second dataset classifier trained on the second dataset; determining a second predicted value of the feature in the second dataset based on a first dataset classifier trained on the first dataset; determining a first similarity score between the first value and the first predicted value; determining a second similarity score between the second value and the second predicted value; and generating a bipartite graph that comprises a first node indicating the first value, a second node indicating the second value, and an edge indicating the first or second similarity score.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term “plurality,” as used herein, is defined as two or more than two. The term “another,” as used herein, is defined as at least a second or more. The term “coupled,” as used herein, is defined as connected, whether directly without any intervening elements or indirectly with at least one intervening elements, unless otherwise indicated. Two elements can be coupled mechanically, electrically, or communicatively linked through a communication channel, pathway, network, or system. The term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will also be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms, as these terms are only used to distinguish one element from another unless stated otherwise or the context indicates otherwise.
The various components (e.g., components 129, 130, and/or 140) depicted in
Data inconsistencies resolving system 110 may comprise a common feature identifying engine 121, a classifier training engine 122, a mapping engine 123, a bipartite graph engine 124, a display engine 125, and/or other engines. The term “engine”, as used herein, refers to a combination of hardware and programming that performs a designated function. As is illustrated respect to
Common feature identifying engine 121 may identify a feature (referred herein as the “feature k”) that is common to a first dataset and a second dataset, wherein at least one value of the feature in the first dataset is different from at least one value of the feature in the second dataset. The first dataset and the second dataset may include the same set of features or at least one common feature. For example, in the case of sales opportunity data discussed above, both datasets would have the same set of features such as product, country, price, business unit, etc.
While the first and second datasets have the same set of features or at least one common feature, the values of a particular feature in the first dataset may be different from the values of the same feature in the second dataset. Common feature identifying engine 121 may identify the unique values of the particular feature in the first dataset and the unique values of the same feature in the second dataset. When the unique values of the feature in the first and second datasets are not identical, a mismatch in the values in that feature may be detected. This mismatch may indicate that there has been a change in the values of that feature.
Suppose that an example dataset as illustrated in
Classifier training engine 122 may train a first dataset classifier on the first dataset. As used herein, a “classifier” may refer to any machine learning classifier (e.g., Nearest Neighbor classifier) that may be trained using a training dataset to classify a plurality of data elements into a plurality of classes. The classifier may predict the classification of each element and/or make an assessment of the confidence in that prediction (e.g., determine a confidence score).
The training set that is used to train the first dataset classifier may be a portion of the first dataset. The portion of the first dataset may exclude the feature (e.g., the feature k) comprising a first set of values. In the example illustrated in
Similarly, classifier training engine 122 may train a second dataset classifier on the second dataset. The training set that is used to train the second dataset classifier may be a portion of the second dataset. The portion of the second data set may exclude the feature (e.g., the feature k) comprising a second set of values. In the example illustrated in
Mapping engine 123 may determine, using the second dataset classifier, first mappings from the first set of values to the second set of values. An example of the first mappings is illustrated in
In computing such similarity scores, mapping engine 123 may determine, for each data record of the first dataset (or a portion of the first dataset), a predicted value of the feature (e.g., the feature k) using the second dataset classifier. Returning to the example above, for each data record (e.g., starting from the data record identified by Id 1) of the first dataset in
Mapping engine 123 may determine a first similarity score between a first value of the feature in the first dataset and a first predicted value where the first predicted value may have been predicted using the second dataset classifier for the data record that contains the first value. The determination of the first similarity score may, for example, be computed based on a number of the first value (or the number of the data records having the first value) in the first dataset that was classified with the first predicted value using the second dataset classifier. In
Similarly, mapping engine 123 may determine, using the first dataset classifier, second mappings from the second set of values to the first set of values. An example of the second mappings is illustrated in
In computing such similarity scores, mapping engine 123 may determine, for each data record of the second dataset (or a portion of the second dataset), a predicted value of the feature (e.g., the feature k) using the first dataset classifier. Returning to the example above, for each data record (e.g., starting from the data record identified by Id 1) of the second dataset in
Mapping engine 123 may determine a second similarity score between a second value of the feature in the second dataset and a second predicted value where the second predicted value may have been predicted using the first dataset classifier for the data record that contains the second value. The determination of the second similarity score may, for example, be computed based on a number of the second value (or the number of the data records having the second value) in the second dataset that was classified with the second predicted value using the first dataset classifier. In
Note that the first and second mappings (e.g., as illustrated in
In some implementations, mapping engine 123 may normalize the similarity scores (e.g., the first similarity score, the second similarity score, etc.). The normalization can ensure that the similarity score is invariant to the sample size of each value both in the first dataset and the second dataset. One way of normalizing the similarity scores is to normalize each score to the range of 0-1. Alternatively or additionally, any other normalization methods may be used. Mapping engine 123 may remove mappings based on low similarity scores and/or normalized similarity scores. For example, mapping engine 123 may compare the normalized score against a threshold. If the normalized score is equal to or less than the threshold, the score may be set to zero or to a predetermined number.
In some implementations, mapping engine 123 may generate a combined similarity score that combines the first similarity score and the second similarity score. The two scores (or two normalized scores) may be combined in various ways. For example, they may be combined by adding, multiplying, and/or taking a maximum or minimum value between the two scores. The combined score may be further normalized using any of the normalization methods as discussed herein.
Bipartite graph engine 124 may generate a bipartite graph based on the first and/or second mappings. The bipartite graph (e.g., as illustrated in
In some implementations, the edge may be bi-directional when both the first and second mappings exist between a pair of feature values. In some implementations, the bi-directional edge may indicate the combined similarity score as discussed herein with respect to mapping engine 123. When the combined similarity score is greater than zero or a predetermined threshold, the bi-directional edge may be created between the pair of feature values in the bipartite graph.
In some implementations, the edge may be visually different depending on the first, second, and/or the combined similarity scores associated with the edge. The appearance of the edge may vary in thickness, darkness, color, shape, and/or other visual characteristics of the edge based on the similarity score. For example, an edge with a higher similarity score may appear differently (e.g., thicker line) from another edge with a lower similarity score.
Display engine 125 may cause a display of the bipartite graph to enable a user to interact with the bipartite graph via the display. The user may interact with the bipartite graph by, for example, adding, modifying, or deleting at least one of the nodes or edges of the bipartite graph. In some instances, the user may modify the similarity score associated with a particular edge. This allows the user to review, verify, and/or confirm the discovered mappings between the first dataset and the second dataset.
In performing their respective functions, engines 121-125 may access data storage 129 and/or other suitable database(s). Data storage 129 may represent any memory accessible to data inconsistencies resolving system 110 that can be used to store and retrieve data. Data storage 129 and/or other database may comprise random access memory (RAM), read-only memory (ROM), electrically-erasable programmable read-only memory (EEPROM), cache memory, floppy disks, hard disks, optical disks, tapes, solid state drives, flash drives, portable compact disks, and/or other storage media for storing computer-executable instructions and/or data. Data inconsistencies resolving system 110 may access data storage 129 locally or remotely via network 50 or other networks.
Data storage 129 may include a database to organize and store data. The database may reside in a single or multiple physical device(s) and in a single or multiple physical location(s). The database may store a plurality of types of data and/or files and associated data or file description, administrative information, or any other data.
In the foregoing discussion, engines 121-125 were described as combinations of hardware and programming. Engines 121-125 may be implemented in a number of fashions. Referring to
In
In the foregoing discussion, engines 121-125 were described as combinations of hardware and programming. Engines 121-125 may be implemented in a number of fashions. Referring to
In
Machine-readable storage medium 310 (or machine-readable storage medium 410) may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. In some implementations, machine-readable storage medium 310 (or machine-readable storage medium 410) may be a non-transitory storage medium, where the term “non-transitory” does not encompass transitory propagating signals. Machine-readable storage medium 310 (or machine-readable storage medium 410) may be implemented in a single device or distributed across devices. Likewise, processor 311 (or processor 411) may represent any number of processors capable of executing instructions stored by machine-readable storage medium 310 (or machine-readable storage medium 410). Processor 311 (or processor 411) may be integrated in a single device or distributed across devices. Further, machine-readable storage medium 310 (or machine-readable storage medium 410) may be fully or partially integrated in the same device as processor 311 (or processor 411), or it may be separate but accessible to that device and processor 311 (or processor 411).
In one example, the program instructions may be part of an installation package that when installed can be executed by processor 311 (or processor 411) to implement data inconsistencies resolving system 110. In this case, machine-readable storage medium 310 (or machine-readable storage medium 410) may be a portable medium such as a floppy disk, CD, DVD, or flash drive or a memory maintained by a server from which the installation package can be downloaded and installed. In another example, the program instructions may be part of an application or applications already installed. Here, machine-readable storage medium 310 (or machine-readable storage medium 410) may include a hard disk, optical disk, tapes, solid state drives, RAM, ROM, EEPROM, or the like.
Processor 311 may be at least one central processing unit (CPU), microprocessor, and/or other hardware device suitable for retrieval and execution of instructions stored in machine-readable storage medium 310. Processor 311 may fetch, decode, and execute program instructions 321-323, and/or other instructions. As an alternative or in addition to retrieving and executing instructions, processor 311 may include at least one electronic circuit comprising a number of electronic components for performing the functionality of at least one of instructions 321-323, and/or other instructions.
Processor 411 may be at least one central processing unit (CPU), microprocessor, and/or other hardware device suitable for retrieval and execution of instructions stored in machine-readable storage medium 410. Processor 411 may fetch, decode, and execute program instructions 421-425, and/or other instructions. As an alternative or in addition to retrieving and executing instructions, processor 411 may include at least one electronic circuit comprising a number of electronic components for performing the functionality of at least one of instructions 421-425, and/or other instructions.
Method 500 may start in block 521 where method 500 may identify a feature that is common to a first dataset and a second dataset, wherein a first value of the feature in the first dataset is different from a second value of the feature in the second dataset. While the first and second datasets have the same set of features or at least one common feature, the values of a particular feature in the first dataset may be different from the values of the same feature in the second dataset. When the unique values of the feature in the first and second datasets are not identical, a mismatch in the values in that feature may be detected. This mismatch may indicate that there has been a change in the values of that feature.
In block 522, method 500 may determine a first predicted value of the feature in the first dataset based on a second dataset classifier trained on the second dataset. For example, for each data record (e.g., starting from the data record identified by Id 1) of the first dataset in
In block 523, method 500 may determine a first similarity score between the first value and the first predicted value where the first predicted value may have been predicted using the second dataset classifier for the data record that contains the first value. The determination of the first similarity score may, for example, be computed based on a number of the first value (or the number of the data records having the first value) in the first dataset that was classified with the first predicted value using the second dataset classifier. In
In block 524, method 500 may determine a second predicted value of the feature in the second dataset based on a first dataset classifier trained on the first dataset. For example, for each data record (e.g., starting from the data record identified by Id 1) of the second dataset in
In block 525, method 500 may determine a second similarity score between the second value and the second predicted value where the second predicted value may have been predicted using the first dataset classifier for the data record that contains the second value. The determination of the second similarity score may, for example, be computed based on a number of the second value (or the number of the data records having the second value) in the second dataset that was classified with the second predicted value using the first dataset classifier. In
Note that the first and second similarity scores are not necessarily identical since the predictions depend on different training sets (e.g., the first similarity score based on the second dataset and the second similarity score based on the first dataset). For example, in
In block 526, method 500 may generate a bipartite graph that comprises a first node indicating the first value, a second node indicating the second value, and an edge indicating the first or second similarity score. The bipartite graph (e.g., as illustrated in
Referring back to
Method 600 may start in block 621 where method 600 may identify a feature (e.g., feature k) that is common to a first dataset and a second dataset, wherein a first value of the feature in the first dataset is different from a second value of the feature in the second dataset. While the first and second datasets have the same set of features or at least one common feature, the values of a particular feature in the first dataset may be different from the values of the same feature in the second dataset. When the unique values of the feature in the first and second datasets are not identical, a mismatch in the values in that feature may be detected. This mismatch may indicate that there has been a change in the values of that feature.
In block 622, method 600 may train a second dataset classifier using a portion of the second dataset. The portion of the second dataset may include a plurality of features except the feature (e.g., feature k). In the example illustrated in
In block 623, method 600 may train a first dataset classifier using a portion of the first dataset. The portion of the first dataset may include the plurality of features except the feature (e.g., feature k). In the example illustrated in
In block 624, method 600 may determine a first predicted value of the feature in the first dataset based on the second dataset classifier. For example, for each data record (e.g., starting from the data record identified by Id 1) of the first dataset in
In block 625, method 600 may determine a first similarity score between the first value and the first predicted value where the first predicted value may have been predicted using the second dataset classifier for the data record that contains the first value. The determination of the first similarity score may, for example, be computed based on a number of the first value (or the number of the data records having the first value) in the first dataset that was classified with the first predicted value using the second dataset classifier. In
In block 626, method 600 may determine a second predicted value of the feature in the second dataset based on the first dataset classifier. For example, for each data record (e.g., starting from the data record identified by Id 1) of the second dataset in
In block 627, method 500 may determine a second similarity score between the second value and the second predicted value where the second predicted value may have been predicted using the first dataset classifier for the data record that contains the second value. The determination of the second similarity score may, for example, be computed based on a number of the second value (or the number of the data records having the second value) in the second dataset that was classified with the second predicted value using the first dataset classifier. In
Note that the first and second similarity scores are not necessarily identical since the predictions depend on different training sets (e.g., the first similarity score based on the second dataset and the second similarity score based on the first dataset). For example, in
In block 628, method 600 may normalize the first or second similarity score. The normalization can ensure that the similarity score is invariant to the sample size of each value both in the first dataset and the second dataset. One way of normalizing the similarity score is to normalize each score to the range of 0-1. Alternatively or additionally, any other normalization methods may be used.
Some mappings may be removed based on low similarity scores and/or normalized similarity scores. In block 629, method 600 may compare the normalized score against a threshold. If the normalized score is equal to or less than the threshold, the score may be set to zero (block 630).
In block 631, method 600 may combine the first and second similarity scores. The two scores (or two normalized scores) may be combined in various ways. For example, they may be combined by adding, multiplying, and/or taking a maximum or minimum value between the two scores. The combined score may be further normalized using any of the normalization methods as discussed herein.
In block 632, method 600 may generate a bipartite graph based on the combined similarity score. For example, when the combined similarity score is greater than zero or a predetermined threshold, a bi-directional edge may be created between the first value and the second value in the bipartite graph.
Referring back to
Bipartite graph 1100 may be presented to a user to enable the user to interact with bipartite graph 1100 via a display. The user may interact with bipartite graph 1100 by adding, modifying, or deleting at least one of the nodes or edges of bipartite graph 1100. In some instances, the user may modify the similarity score associated with a particular edge. This allows the user to review, verify, and/or confirm the discovered mappings between first dataset 1110 and second dataset 1120.
The foregoing disclosure describes a number of example implementations for resolution of data inconsistencies. The disclosed examples may include systems, devices, computer-readable storage media, and methods for resolution of data inconsistencies. For purposes of explanation, certain examples are described with reference to the components illustrated in
Further, all or part of the functionality of illustrated elements may co-exist or be distributed among several geographically dispersed locations. Moreover, the disclosed examples may be implemented in various environments and are not limited to the illustrated examples. Further, the sequence of operations described in connection with