The present disclosure relates generally to the field of machine learning technology. More specifically, the present disclosure relates to machine learning systems and methods for performing entity resolution using a flexible minimum weight set packing framework.
In the field of machine learning, entity resolution is the task of disambiguating records that correspond to real world entities across and within datasets. Entity resolution can be described as recognizing when two observations relate to the same entity despite having been described differently (e.g., duplicates of the same person with different names in an address book) or recognizing when two observations do not relate to the same entity despite having been described similarly (e.g., two same names where the first has a Jr. suffix and the second has a Sr. suffix). Entity resolution also relates to the ability to remember the relationship between these entities. The applications of entity resolution can be vast for the public sector and federal datasets related to banking, healthcare, insurance, transportation, finance, law enforcement, and the military.
As the volume and velocity of data grows, inference across networks and semantic relationships between entities becomes a greater challenge. Entity resolution can reduce this complexity by de-duplicating and linking entities. Traditional approaches tackle entity resolution with hierarchical clustering. However, these approaches do not benefit from a formal optimization formulation. Thus, these approaches are heuristic and inexact and could benefit from a formal optimization formulation.
Therefore, there is a need for computer systems and methods which can perform entity resolution using an optimized formulation, thereby improving speed and utilizing fewer computational resources. These and other needs are addressed by the machine learning systems and methods of the present disclosure.
The present disclosure relates to machine learning systems and methods for performing entity resolution using a flexible minimum weight set packing framework. The system uses attributes of a table to determine if two observations represent the same real world entity. Specifically, pair identification is performed such that pairs are selected in a high recall-low precision region of a precision-recall curve. This serves to eliminate the overwhelming majority of bad matches while keeping the possible good matches, and exploits the fact that the number of false matches is significantly greater than the number of true matches in entity resolution problems. More specifically, the system first generates a limited set of pairs of observations. The each set of pairs of observations may be co-assigned in a hypothesis. The system then generates a probability score for each pair of observations. The probability score is defined over a given pair of observations which is the probability that the pair is associated with a common entity in ground truth. The system then defines problem specific cost terms of a single hypothesis cost terms associated with pairs of possible co-associate observations. For example, the system can generate cost terms by adding a bias to negative of probability scores. The system then determines a negative (or lowest) reduced cost of the hypothesis (which can be referred to as “pricing”). The system then performs entity resolution using a F-MWSP formulation. Specifically, using the F-MWSP formulation, the system packs observations into a hypotheses based on the cost terms. This generates a bijection from the hypothesis in the packing to real world entities.
The foregoing features of the invention will be apparent from the following Detailed Description of the Invention, taken in connection with the accompanying drawings, in which:
The present disclosure relates to machine learning systems and methods for performing entity resolution using a flexible minimum weight set packing framework, as described in detail below in connection with
The present system describes an optimized approach to entity resolution. Specifically, the present system models entity resolution as correlation-clustering, which the present system treats as a weighted set-packing problem and denotes as an integer linear program (“ILP”). Sources in the input data correspond to elements, and entities in output data correspond to sets/clusters. As will be described in greater detail below, the present system performs optimization of weighted set packing by relaxing integrality in an ILP formulation. Since the set of potential sets/clusters cannot be explicitly enumerated, the present system performs optimization using column generation. In addition, the present system generates flexible dual optimal inequalities (“F-DOIs”) which tightly lower-bound dual variables during optimization and accelerate the column generation. The system applies this formulation to entity resolution to achieve improved accuracy and increase speed using fewer computational resources when processing input data (e.g., datasets).
The classifier system 14 includes a blocking module 16, a scoring module 18, and a labeled subset 20. The blocking module applies a blocking technique to the input data 12, which generates a limited set of pairs of observations which can be co-assigned in a common hypotheses. The scoring module 18 generates a probability score for each pair of observations. The scoring module can be trained by a learning algorithm using the labeled subset 20 to distinguish between observation pairs that are/are not part of a common entity in ground truth (information provided by direct observation as opposed to by inference). This will be explained in greater detail below.
The classifier system 14 generates output data that is fed into the F-MWSP system 22. The F-MWSP system 22 includes a clustering system 24 and processes the output data to generate hypotheses. Specifically, given input data (e.g., a dataset of observations each associated with up to one object), the system 10 packs (or partitions) the observations into groups called hypothesis (or entities) such that there is a bijection from the hypotheses to unique entities in the dataset. The system 10 partitions the observations into the hypothesis so that: (1) all observations of any real world entity are associated with exactly one selected hypothesis; and (2) each selected hypothesis is associated with observations of exactly one real world entity. The processes of the F-MWSP system 22 will be explained in greater detail below.
In step 32, the system 10 generates a limited set of pairs of observations. The each set of pairs of observations may be co-assigned in a hypothesis. In an example, the blocking module 16 filters out a portion of pairs of observations from the input data 12. This leaves a proportion of the pairs for further processing.
In step 34, the system 10 generate a probability score for each pair of observations using the scoring module 18. The probability score is defined over a given pair of observations which is the probability that the pair is associated with a common entity in ground truth. As discussed above, the scoring module can be trained to distinguish by any learning algorithm on annotated data (e.g., the labeled data 20) to generate the probability scores.
In step 36, the system 10 performs entity resolution using a F-MWSP formulation. Specifically, using the F-MWSP formulation, the system 10 packs observations into a hypothesis based on the cost terms. This generates a bijection from the hypothesis in the packing to real world entities. Step 36 will be explained in further detail below with respect to
The system 10 defines the cost of the hypothesis g \ in G, the set of all possible hypotheses. The term G is described using matrix G∈. Gdg=1 if the hypothesis g includes observation d, and otherwise Gdg=0. It is a structural property of the problem domain that most pairs of observations cannot be part of the common hypothesis. For such pairs, d1, d2 then θd1d2=∞. These are the pairs not identified by the blocking module 16 as being feasible. The system 10 uses θdd=0 for all d∈. The system 10 defines the cost of the hypothesis g∈G as shown in Equation 1, below:
With the cost of the hypothesis defined, the system 10 can treat entity resolution as a MWSP (minimum weight set packing) problem, and solve it using column generation. Any observation not associated with any selected hypothesis in the solution to the MWSP problem is defined to be in a hypothesis by itself of zero cost.
The following will discuss an integer literal program (“ILP”) formulation of the MWSP problem. An observation corresponds to an element in a set-packing context and a data source in the entity resolution context. Term is used to denote a set of observations, which are index by term d. A hypothesis corresponds to a set in the set-packing context, and an entity in the entity resolution context. Given a set of observations , the set of all hypotheses is the power set of , which is denoted as term G and index by term g.
A real valued cost Γg is associated to each g∈G, where Γg is the cost of including g in the packing. The hypothesis g containing no observations is defined to have cost Γg=0. A packing is described using γ∈{0, 1}|G| where γg=1 indicates that the hypothesis g is included in the solution, and otherwise γg=0. Thus, MWSP problem written as an ILP is expressed by Equation 2, below:
The constraints in Equation 2 enforce that no observation is included in more than one selected hypothesis in the packing. Solving Equation 2 is challenging for two key reasons. First, MWSP is NP-hard problem. Second term G is too large to be considered in optimization. To tackle the first key reason, the system 10 relaxes the integrality constraints on γ, resulting in a linear problem expressed by Equation 3, below:
The system 10 can circumvent the second key reason using column generation. Specifically, a column generation algorithm constructs a small sufficient subset of G (which is denoted Ĝ and initialized empty), subject to an optimal solution to Equation 3 exists for which only hypothesis in Ĝ are used. Thus, column generation avoids explicitly enumerating term G, which grows exponentially in term |D|. Primal-dual optimization over Ĝ, which is referred to as the restricted master problem (“RMP”), is expressed by Equations 4 and 5, below:
Equation 5
The system 10 can solve Equation 6 using a specialized solver exploiting specific structural properties of the problem domain. In many problem domains, pricing algorithms return multiple negative reduced cost hypothesis in G. In these cases, some or all returned hypotheses with negative reduced cost are added to Ĝ.
The column generation terminates when no negative reduced cost hypotheses remain in term G (e.g.,
If Equation 4 produces a binary valued y at termination of column generation (i.e. the LP-relaxation is tight), then β is probably the optimal solution to Equation 2. However, if γ is fractional at termination of the column generation, an approximate solution to Equation 2 can be obtained by the system 10 by replacing Gin Equation 2 with Ĝ (e.g.,
The convergence of the algorithm in
It is noted that removal of a small number of observations rarely causes a significant change to the cost of a hypothesis in Ĝ. As such, the system 10 can use varying DOIs, which will now be discussed. Term
It is noted that Ξd may increase (but not decrease) over the course of column generation as Ĝ grows. The computation of Ξdg is performed by the system 10 using problem specific worst case analysis for each g upon addition to Ĝ.
A drawback of varying DOI is that Ξd depends on all hypotheses in Ĝ(as defined in Equation 9), while often only a small subset of Ĝ are active (selected) in an optimal solution to Equation 4. Thus, during the process of the algorithm in
The following will discuss a MWSP formulation using column generation featuring the F-DOIs. Given any g∈G, term Ξdg is positive if Gdg=1, otherwise, term Ξdg=0, and defined such that for all non-empty s ⊆ the following bound expressed in Equation 10, below, is satisfied:
Term Zd is a set of unique positive values of Ξdg over all g∈ Ĝ, which are index by term z. The system 10 orders the values in Zd from smallest to largest as [ωd1, ωd2, ωd3 . . . ]. Term Ξdg is described using Zdzg ∈{0, 1} where Zdzg=1 if Ξdg≥ωdz. Additionally, term Ξdg is described using Ξdz as follows: Ξdz=ζdz−ωd(z-1) ∀z∈Zd, z≥2; Ξd1=ωd1. The system 10 uses term Z to model the MWSP problem as a primal/dual LP, as expressed in Equations 11 and 12, below, where the F-DOIs are inequalities −Ξdz≤λdz:
The system 10 conducts efficient pricing under the MWSP formulation of Equation 11 using Equation 13, below:
Returning to
Term d*is the set of observations that may be grouped with observation d*, which can be referred to as its neighborhood. Since the lowest reduced cost hypothesis contains some d*∈D by solving Equation 14 for each d*∈D, the system 10 can solve Equation 6.
In step 64, the system 10 decreases a number of observations considered in the pricing sub-problems, particularly those with large numbers of observations. The system 10 performs step 64 by associating a unique rank rd to each observation d∈, such that rd increases with ||, i.e., the more neighbors an observation has, the higher rank the system 10 assigns to it. To ensure that each observation has unique rank, the system 10 can break ties arbitrarily.
Given that d*is the lowest ranking observation in the hypothesis, the system 10 considers the set of observations subject to d∈{d*∩{rd≥rd*}}, which is defined as *d*. The resultant pricing sub-problem is expressed by Equation 15, below:
In step 66, the system 10 removes superfluous sub-problems. Specifically, the system 10 relaxes the constraint Gd*g=1 in Equation 15. It is noted that for any d2∈,d∈ s.t. d*⊂D*d2 that the lowest reduced cost hypothesis over *d2 has no greater reduced cost than that over Dd*. Neighborhood D*d*can be referred to as being non-dominated if no d2 ∈D exists s.t. Dd*⊂*d2. During pricing, the system 10 iterates over non-dominated neighborhoods. For a given non-dominated neighborhood D*d*, the pricing sub-problem is expressed as Equation 16, below:
In step 68, the system 10 performs exact and/or heuristic pricing. Specifically, the system 10 frames Equation 16 as a ILP, which the system 10 solves using a mixed integer linear programming (“MILP”) solver. Decision variables x, y are set as follows. Binary variable xd=1 to indicate that d is included in the hypothesis being generated and otherwise set xd=0. Variable γd1d2=1 to indicate that both d1, d2 are included in the hypothesis being generated and otherwise set γd1d2=0. The system 10 defines ε−={(d1, d2): θd1d2=∞} as the set containing pairs of observations that cannot be grouped together, and ε+={(d1, d2): θd1d2<∞} as the set containing pairs of observations that can be grouped together. Using these terms, the solution to Equation 16 as a MILP is expressed in Equation 17, below:
Equation 17 is subject to the following four constraints:
x
d1
+x
d2≤1∀(d1,d2)∈ε− Constraint 1:
γd1d2≤xd1∀(d1,d2)∈ε+ Constraint 2:
γd1d2≤xd2∀(d1,d2)∈ε+ Constraint 3:
x
d1
+x
d2−γd1d2≤1∀(d1,d2)∈ε+ Constraint 4:
Equation 17 defines the reduced cost of the hypothesis being constructed. Constraint 1 enforces that pairs for which θd1d2=∞ are not included in a common hypothesis. Constraints 2-4 enforce that γd1d2=xd1xd2. It is noted that since variable x is binary, variable y must also be binary so as to obey constraints 2-4. Thus, the system 10 does not need to explicitly enforce y to be binary.
It is noted that the system 10 solving Equation 16 using Equation 17 and constraints 1-4 for each non-dominated neighborhood can be too time intensive for some scenarios. This is because Equation 16 generalizes max-cut, which is NP-hard. Accordingly, the system 10 can use heuristic methods (e.g., heuristic pricing) to solve Equation 16. By using heuristic pricing in machine learning/computer vision, the system 10 decreases the computation time of pricing by decreasing the number of sub-problems solved, and solving those that are solved heuristically.
In early termination of pricing, it is noted that solving pricing (exactly or heuristically) over a limited subset of the sub-problems produces an approximate minimizer of Equation 6. The system 10 decreases the number of sub-problems solved during a given iteration of column generation as follows. The system 10 terminates pricing in a given iteration when M negative reduced cost hypothesis have been added to G in that iteration of column generation (M is a user defined constant; M=50 is used by way of example). This process can be referred to as partial pricing.
The system 10 can solve the sub-problems approximately (e.g., solve Equation 17 with constraints 1-4) using a quadratic pseudo-Boolean optimization with improve option (“QPBO-I”) method. It is noted that the use of heuristic pricing does not prohibit the exact solution of Equation 3. The system 10 can switch to exact pricing after heuristic pricing fails to find a negative reduced cost hypothesis in G.
Returning to step 36 of
The system 10 bounds the components of Equation 18 as follows. For θd1d2<0, the system upper bounds −θd1d2 max([d1∈s], [d2∈s]) with: −θd1d2([d1∈s]+[d2∈s]). For θd1d2>0, the system 10 upper bounds −θd1d2 max([d1∈s], [d2∈s]) with: −(θd1d2/2) ([d1∈ s]+[d2∈s]). The system then plugs the upper bounds into Equation 18, grouped by [d∈s], and enforces non-negativity of the result. Equation 18≤[d∈s]Ξdg where Ξdg=0 for d∉g, is expressed in Equation 19, below:
Testing and analysis of the above systems and methods will now be discussed in greater detail. Specifically, the following will discuss different properties of the F-MWSP clustering algorithm and evaluate the performance scores on certain benchmark datasets. The classifier system 14 used an entity resolution library called Dedupe to perform blocking and scoring functionalities. Dedupe offers attribute type specific blocking rules and a ridge logistic regression algorithm as a default for scoring. However, the classifier system 14 can keep the domain of the dataset in mind, thus significantly boosting the performance of the clustering outcome.
To understand the benefits of F-MWSP clustering, it is helpful to first conduct an ablation study on a single dataset. The dataset chosen in this section is called patent_example and is available on the Dedupe library. Dataset patent_example is a labeled dataset listing the patent statistics of the Dutch innovators. It has 2379 entities and 102 clusters where the mean size of the cluster is 23. The dataset was split into two halves and the second half was set aside only to report the accuracies. The first half of the dataset that is visible to the learning algorithm from which approximately 1% of the total matches were randomly sampled and provided the classifier system 14 as a labeled data.
The present system also provides tractable solutions to the pricing problem. Specifically, regarding solving pricing exactly or heuristically, exact pricing is often not feasible in entity resolution owing to the large neighborhoods of some sub-problems. However, the present system using the heuristic solver cut down the computation time by a large fraction. For example, dataset patent_example takes at least 1 hour for completion with the exact solver while with the heuristic solver it takes approximately 20 seconds.
Experiments were also conducted with additional entity resolution benchmark datasets. Specifically, comparing to the csv_example dataset (which is available on Dedupe and akin to patent_example), the F-MWSP formulation achieves a higher F1 score of 95.2% against hierarchical clustering 94.4%, the default in Dedupe.
The functionality provided by the present disclosure could be provided by computer software code 106, which could be embodied as computer-readable program code stored on the storage device 104 and executed by the CPU 112 using any suitable, high or low level computing language, such as Python, Java, C, C++, C#, .NET, MATLAB, etc. The network interface 108 could include an Ethernet network interface device, a wireless network interface device, or any other suitable device which permits the server 102 to communicate via the network. The CPU 112 could include any suitable single-core or multiple-core microprocessor of any suitable architecture that is capable of implementing and running the computer software code 106 (e.g., Intel processor). The random access memory 114 could include any suitable, high-speed, random access memory typical of most modern computers, such as dynamic RAM (DRAM), etc.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art can make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure.
This application claims priority to U.S. Provisional Patent Application Ser. No. 62/898,681 filed on Sep. 11, 2019, the entire disclosure of which is hereby expressly incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62898681 | Sep 2019 | US |