The system, apparatuses, and methods described herein generally relate to machine learning techniques, and, in particular, to predictive analytics solutions using distributed data.
Typically, data sets that are used in predictive analytics solutions (PA) are represented by a category of instances where each instance stores the values of several attributes/features. Most of the existing predictive analytics tools (e.g., the ones using the knowledge discovery/data mining/predictive analytics techniques) assume that all the data shall be collected in a single host machine and represented by a homogeneous data and metadata structure.
In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s). The term comes from the three basic steps needed: extracting (selecting and exporting) data from the source, transforming the way the data is represented to the form expected by the destination, and loading (reading or importing) the transformed data into the destination system.
As we experience exponential growth in data, this assumption requires the definition and implementation of complex ETL processes, and in many siloed data-collecting scenarios it is technically infeasible and/or cost-prohibiting. Data silos were created to address specific business objectives, and as such, most of the enterprise data warehousing systems are challenged by the inability to aggregate data to support predictive analytics-based decision-making.
The distributed nature of data exhibits two types of data fragmentation (see
In addition, almost all predictive analytics algorithms require the data sets to be stored entirely in main memory. If the computational cost exceeds the main memory, then the algorithm is challenged by the potential unfeasibility of completion due to memory restrictions or long runtimes. However, with data fragments as units of distribution, the analysis task can be divided into several sub-tasks that operate together in parallel. The distributed data analysis approach would make a better exploitation of the available computing networked infrastructure.
Following the above observations, there have been different, mostly academic research and development-oriented efforts (as such solutions are not available in open source repositories) directed towards data analysis from distributed data sources. The problem with most of these efforts is that although they allow the data sources to be distributed over a network of data silos, they assume that the distributed data of comment entities is defined over the same set of features. In other words, they assume that the data is partitioned horizontally (
The Distributed DensiCube modeler and scorer extend the same predictive analytics algorithms that have been already implemented (i.e., Bottomline's DensiCube solution, as partially described in the U.S. Pat. No. 9,489,627, issued to Jerzy Bala on Nov. 8, 2016, said patent incorporated herein by reference in its entirety) to enable their execution in the distributed data siloed environments. The algorithm described in U.S. Pat. No. 9,489,627, distributed, is one possible embodiment of the inventions herein. Other machine learning algorithms could also be used.
The immediate benefits of the Distributed DensiCube include:
The Distributed DensiCube approach represents a paradigm shift of moving from the currently predominant Data Centric approaches to predictive analytics, i.e., approaches that transform, integrate, and push data from distributed silos to predictive analytics agents, to the future Decision Centric (Predictive Analytics Bot Agent-based approaches, i.e., approaches that push predictive analytics agents to the data locations and by collaborating support decision-making in the distributed data environments).
Collaborating Predictive Analytics Bot Agents can facilitate numerous opportunities for enterprise data warehousing to provide faster, more predictive/prescriptive, and time and cost-saving decision-making solutions for their customers.
Examples of the use of this in banking applications where each branch has its own database 610, 620, 630 of customers 611-619. For privacy, security, and performance reasons, the data is kept in the branches, but the bank needs to use the data from each branch for its machine learning algorithms. The predictive analytics data needs to be aggregated into a model without transferring the data to a central location.
Similarly, when opening a new account at a bank, machine learning models need to be built for predictive analytics. The data for the customer 802 may be in the branch database for the customer name and address, the customer's credit history may be in a separate database with a credit bureau (such as Equifax, Experian, and TransUnion) 803, and the customers real estate holdings and mortgages may be in a third database at the registry of deeds 804 (see
There is a need in the industry for the building of machine learning models using distributed data without moving the data.
A distributed method for creating a machine learning rule set is described herein. The method is made up of the steps of (1) preparing, on a computer, a set of data identifiers to identify the data elements for training the machine learning rule set, (2) sending the set of data identifiers to a plurality of data silos; (3) executing, on each data silo, a machine learning algorithm using the data elements and the data identifiers on the data silo to derive a silo specific rule set; (4) calculating, on each data silo, a quality control metric on the silo specific rule set; (5) sending the quality control metric from each data silo to the computer; and (5) combining, on the computer, the quality control metrics from each data silo into a combined quality control metric.
In some embodiments, the quality control metric is an F-Score. The combined quality control metric could use a weighted algorithm. The data silos could be made up of a special-purpose processor and a special-purpose storage facility.
In some embodiments, the method also includes sending the silo-specific rule sets to the computer from at least one of the plurality of data silos. And the method could further include sending a plurality of silo-specific rule sets and quality control metrics associated with the silo-specific rule sets, from the data silos to the computer. Yet in other embodiments, the silo-specific rule sets are not returned to the computer. In some cases, a set of training results are sent with the identifiers to the plurality of data silos from the computer. The machine learning algorithm could create a test rule by adding a condition, calculating a test quality metric, and saving the test rule and test quality metric if the quality metric is better than previously saved test quality metrics. In some cases, the condition could be a range locating clusters of data.
A distributed system for creating a machine learning rule set is also described herein. The system is made up of a computer, a network, and a plurality of data silos. The computer executes software to prepare a set of data identifiers to identify data elements in a plurality of data silos. The network is connected to the computer and the data silos and sends data between them. The plurality of data silos each independently executes machine learning software to create a silo-specific rule set based on the data identifiers and silo-specific data elements, and calculate silo-specific quality control metrics for the silo-specific rule set, and the data silos return the silo-specific quality control metrics to the computer. The computer executes software to combine the quality control metrics from each data silo into a combined quality control metric.
A distributed method for creating a machine learning rule set is also described here, where the method is made up of the following steps. First of all, preparing, on a computer, a set of data identifiers to identify the data elements representing similar events for training the machine learning rule set. Next, sending the set of data identifiers to a plurality of data silos. Then, receiving a quality control metric from each data silo, where the quality control metric from each data silo represents the quality control metric calculated using a silo-specific rule set that was derived from a machine learning algorithm using the data elements and the data identifiers on the data silo. Finally, combining the quality control metrics from each data silo into a combined quality control metric.
In addition, non-transitory computer-readable media is described that is programmed to prepare, on a computer, a set of data identifiers to identify the data elements representing similar events for training the machine learning rule set. The media is further programmed to send the set of data identifiers to a plurality of data silos, and receive a quality control metric from each data silo. The program also combines the quality control metrics from each data silo into a combined quality control metric. The quality control metric from each data silo represents the quality control metric calculated using a silo-specific rule set that was derived from a machine learning algorithm using the data elements and the data identifiers on the data silo.
In some aspects, the techniques described herein relate to a machine learning apparatus including: a first data storage device including a first distributed data set; a first network connector connected to a network, the first network connector in communications with a second network connector on a second data storage device on a machine learning server, the second data storage device including a second distributed data set; a model orchestrator, stored in the first data storage device and executing on the machine learning apparatus, the model orchestrator programmed to publish a set of data identifiers including data elements and data features, and programmed to send the set of the data identifiers through the first network connector to the second network connector to a second prediction manager executing on the machine learning server; a first prediction manager connected to the first data storage device programmed to receive the set of the data identifiers from the model orchestrator and to calculate a first quality control metric and a first rule set using a first machine learning algorithm on the first distributed data set; a prediction orchestrator programmed to receive the first quality control metric and the first rule set from the first prediction manager and to receive from the second prediction manager a second quality control metric and a second rule set determined from the second distributed data set; and the prediction orchestrator further programmed to combine the first rule set and the second rule set into a common rule set and to combine the first quality control metric and the second quality control metric into a combined quality control metric.
In some aspects, the techniques described herein relate to a machine learning apparatus further including a data set template, stored in the first data storage device, that contains a definition of the first distributed data set and the second distributed data set.
In some aspects, the techniques described herein relate to a machine learning apparatus wherein the model orchestrator publishes the data set template.
In some aspects, the techniques described herein relate to a machine learning apparatus wherein the combined quality control metric uses a weighted algorithm.
In some aspects, the techniques described herein relate to a machine learning apparatus wherein the combined quality control metric is an F-score.
In some aspects, the techniques described herein relate to a machine learning apparatus wherein the network is the Internet.
In some aspects, the techniques described herein relate to a machine learning apparatus wherein the network is a local area network.
In some aspects, the techniques described herein relate to a machine learning apparatus wherein the first machine learning algorithm creates a test rule by adding a condition, calculating a test quality metric, and saving the test rule and the test quality metric if the test quality metric is better than previously saved test quality metrics.
In some aspects, the techniques described herein relate to a machine learning apparatus wherein the condition is a range locating clusters of data.
In some aspects, the techniques described herein relate to a machine learning apparatus wherein the second distributed data set is kept private from the machine learning apparatus.
In some aspects, the techniques described herein relate to a machine learning method including: connecting a machine learning apparatus, including a first distributed data set stored on a first data storage device with a second data storage device on a machine learning server, the second data storage device including a second distributed data set; publishing, by a model orchestrator on the machine learning apparatus, a set of data identifiers including data elements and data features; sending, by the model orchestrator, the set of the data identifiers to a second prediction manager on the machine learning server over a network; receiving, by a first prediction manager on the machine learning apparatus, the set of the data identifiers from the model orchestrator; calculating, by the first prediction manager, a first quality control metric and a first rule set using a first machine learning algorithm on the first distributed data set; receiving, by a prediction orchestrator, the first quality control metric and the first rule set from the first prediction manager; receiving, by the prediction orchestrator from the second prediction manager, a second quality control metric and a second rule set as determined from the second distributed data set; combining, by the prediction orchestrator the first rule set and the second rule set into a common rule set; and combining, by the prediction orchestrator, the first quality control metric and the second quality control metric into a combined quality control metric.
In some aspects, the techniques described herein relate to a machine learning method further including creating a data set template, stored in the first data storage device, that contains a definition of the first distributed data set and the second distributed data set.
In some aspects, the techniques described herein relate to a machine learning method further including publishing, by the model orchestrator, the data set template.
In some aspects, the techniques described herein relate to a machine learning method wherein the combined quality control metric uses a weighted algorithm.
In some aspects, the techniques described herein relate to a machine learning method wherein the combined quality control metric is an F-score.
In some aspects, the techniques described herein relate to a machine learning method wherein the network is the Internet.
In some aspects, the techniques described herein relate to a machine learning method wherein the network is a local area network.
In some aspects, the techniques described herein relate to a machine learning method further including creating, by the first machine learning algorithm, a test rule by adding a condition, calculating a test quality metric, and saving the test rule and the test quality metric if the test quality metric is better than previously saved test quality metrics.
In some aspects, the techniques described herein relate to a machine learning method wherein the condition is a range locating clusters of data.
In some aspects, the techniques described herein relate to a machine learning method wherein the second distributed data set is kept private from the machine learning apparatus.
The following description outlines several possible embodiments to create models using distributed data. The Distributed DensiCube modeler and scorer described below extend the predictive analytic algorithms that are described in U.S. Pat. 9,489,627 to extend their execution in distributed data environments and into quality analytics. The rule learning algorithm for DensiCube is briefly described below. But the DensiCube machine learning algorithm is only one embodiment of the inventions herein. Other machine learning algorithms could also be used.
The rule learning algorithm induces a set of rules. A rule itself is a conjunction of conditions, each for one attribute. A condition is a relational expression in the form:
A=V,
where A is an attribute and V is a nominal value for a symbolic attribute or an interval for a numeric attribute. The rule induction algorithm allows for two important learning parameters 102: minimum recall and minimum precision. More specifically, rules generated by the algorithm must satisfy the minimum recall and minimum precision requirements 105 as set by these parameters 102. The algorithm repeats the process of learning a rule 103 for the target class and removing all target class examples covered by the rule 104 until no rule can be generated to satisfy the minimum recall and minimum precision requirements 105 (
In learning a rule, as seen in
Looking at 211 and 212, the rule 212 covers all of the positive and negative values, and rule 211 is empty. This rule set is then scored and compared to the base rule 201. The best rule is stored.
Next, the algorithm increments the x-axis split between the rules, creating rules 231 and 232. The rules are scored and compared to the previous best rule.
The process is repeated until all but one increment on the x-axis is left. These rules 241, 242 are then scored, compared, and stored if the score is better.
Once the x-axis has been searched, the best rules are then split on the y-axis (for example, 251,252) to find the best overall rule. This process may be repeated for as many axes as found in the data.
In the Distributed DensiCube algorithm, the functions shown in
In the Distributed DensiCube algorithm, the entire process described in
Looking at
Every rule induction algorithm uses a metric to evaluate or rank the rules that it generates. Most rule induction algorithms use accuracy as the metric. However, accuracy is not a good metric for imbalanced data sets. The algorithm uses an F-measure as the evaluation metric. It selects the rule with the largest F-measure score. F-measure is widely used in information retrieval and in some machine learning algorithms. The two components of F-measure are recall and precision. The recall of a target class rule is the ratio of the number of target class examples covered by the rule to the total number of target class examples. The precision of a target class (i.e., misstatement class) rule is the ratio of the number of target class examples covered by the rule to the total number of examples (from both the target and non-target classes) covered by that rule. F-measure of a rule r is defined as:
where β is the weight. When β is set to 1, recall and precision are weighted equally. F-measure favors recall with β>1 and favors precision with β<1. F-measure can be used to compare the performances of two different models/rules. A model/rule with a larger F-measure is better than a model/rule with a smaller F-measure .
The algorithms incorporate a method, called prototype generation, to facilitate ranking with rules. For each rule generated by the rule learning algorithm, two prototypes are created. In generating prototypes, the software ignores symbolic conditions, because examples covered by a rule share the same symbolic values. Given a rule R with m numeric conditions: AR1=VR1∧AR2=VR2∧. . . ∧ARm=VRm, where ARi is a numeric attribute and VRi is a range of numeric values, the positive prototype of R, P(R)=(pR1, pR2, . . . , pRm) and the negative prototype of R N(R)=(nR1, nR2, . . . , nRm), where both pRi ϵ VRi and nRi ϵ VR1. pRi and nRi are computed using the following formulas:
where R(POS) and R(NEG) are the sets of positive and negative examples covered by R respectively, e=(eR1, eR2, . . . , eRm) is an example, and eRi ϵ VR1 for i=1, . . . , m, because e is covered by R.
Given a positive prototype P(R)=(pR1, pR2, . . . , pRm) and a negative prototype N(R)=(nR1, nR2, . . . , nRm) of rule R, the score of an example e=(eR1, eR2, eRm) is 0 if e is not covered by R. Otherwise, e receives a score between 0 and 1 computed using the following formula:
where wRi is the weight of Rith attribute of R. The value of
is between −1 and 1. When eRi>nRi>pRi or pRi>nRi>eRi it is −1. When eRi>pRi>nRi or nRi>pRi>eR, it is 1. When eRi is closer to nRi than pRi, it takes a value between −1 and 0. When eRi is closer to pRi than nRi, it takes a value between 0 and 1. The value of score(e, R) is normalized to the range of 0 and 1. If pRi=nRi, then
is set to 0.
wR1 is computed using the following formula.
where maxRi and minRi are the maximum and minimum values of the Rith attribute of R, respectively. The large difference between pRi and nRi implies that the values of positive examples are very different from the values of negative examples on the Rith attribute, so the attribute should distinguish positive examples from negative ones as well.
A rule induction algorithm usually generates a set of overlapped rules. Two methods, Max and Probabilistic Sum, for combining example scores of multiple rules are used by the software. Both methods have been used in rule-based expert systems. The max approach simply takes the largest score of all rules. Given an example e and a set of n rules R={R1, . . . , Rn,}, the combined score of e using Max is computed as follows:
score(e,R)=maxi=1n{Precision (Ri)×score(e,R)},
where precision(Ri) is the precision of Ri. There are two ways to determine score(e,Ri) for a hybrid rule. The first way returns the score of e received from rule Ri for all e's. The second way returns the score of e received from Ri only if the score is larger than or equal to the threshold of Ri, otherwise, the score is 0. The first way returns. For a normal rule,
For the probabilistic sum method, the formula can be defined recursively as follows.
score(e,{R1})=score(e,R1)
score(e,{R1,R2})=score(e,R1)+score(e,R2)−score(e,R1)×score(e,R2)
score(e,{R1, . . . , Rn})=score(e,{R1, . . . , Rn−1})+score(e,Rn)−score(e,{R1, . . . , Rn−1})×score(e,Rn)
Turning to
By allowing for distributed execution, the Distributed DensiCube algorithm allows for a number of important benefits. First of all, the privacy of the data assets in the model generation and prediction modes of operation are preserved by keeping the data in its original location and limiting access to the specific data. Second, the cost of implementing complex ETL processes and data warehousing, in general, is reduced by eliminating the costs of transmission to and storage in a central location. Third, these inventions increase performance by allowing parallel execution of the DensiCube algorithm (i.e., executing the predictive analytics algorithms on distributed computing platforms). In addition, this distributed algorithm provides the capability for the Distributed DensiCube algorithm to provide unsupervised learning (e.g., fraud detection from distributed data sources). Finally, it allows predictive analytics solutions to operate and react in real-time on a low-level transactional streaming data representation without requiring data aggregation.
The Distributed DensiCube approach represents a paradigm shift moving from the currently predominant Data Centric approaches to predictive analytics, i.e., approaches that transform, integrate, and push data from distributed silos to predictive analytics agents, to the future Decision Centric (predictive analytics bot agent-based) approaches, i.e., approaches that push predictive analytics agents to the data locations and by collaborating support decision-making in the distributed data environments.
Essentially, the distributed DensiCube algorithm operates the Densicube algorithm on each server 503, 505, 507 analyzing the local data in the database 504, 506, 508. The best rule or best set of rules 405 from each server 503, 505, 507 is then combined into the best overall rule. In some embodiments, several servers could work together to derive a best rule, that is combined with another server.
Collaborating predictive analytics bot agents can facilitate numerous opportunities for enterprise data warehousing to provide faster, more predictive, more prescriptive, and time and cost-saving decision-making solutions for their customers.
The following sections describe the concept behind the Distributed DesiCube approach. As mentioned in the previous section, the Distributed DensiCube solution continues to use the same modeling algorithms as the current non-distributed predictive analytics solution (with modifications to the scoring algorithms to support privacy by preserving the data assets in silos).
The Distributed DensiCube operates on distributed entities at different logical and/or physical locations.
The distributed entity represents a unified virtual feature vector describing an event (e.g., financial transaction, customer campaign information). Feature subsets 704, 705 of this representation are registered/linked by a common identifier (e.g., transaction ID, Enrolment Code, Invoice ID, etc.) 707. Thus, a distributed data 701 represents a virtual table 706 of joined feature subsets 704, 705 by their common identifier 707 (see
In
As an example of the distributed DensiCube algorithm, see
The credit agency database 803 contains three fields, the ID(SSN), the Credit Score, and the Total Debt fields. The registry of deeds database 804 also has three fields in this example, the ID(SSN), a home ownership field, and a home value field. In our example, there are a number of reasons that the data in the credit agency 803 needs to be kept separate from the registry data 804, and both of those datasets need to be kept separate from the bank data 802. As a result, the DensiCube algorithm is run three times on each of the databases 802, 803, 804. In another embodiment, two of the servers could be combined, with the algorithm running on one of the servers. This embodiment is seen in
As seen in
All the above components collaborate to generate models and use them for scoring, and at the same time, preserve the privacy of the data silos 1002. There are three levels of privacy that are possible in this set of inventions. The first level could preserve the data in the silos, providing privacy only for the individual data records. A second embodiment preserves the attributes of the data in the silos, preventing the model from knowing the attributes. The second embodiment may also hide the features (names of attributes) by instead returning a pseudonym for the features. In the third embodiment, the features themselves are kept hidden in the silos. For example, in the first level, the range of the credit scores is between 575 and 829 is reported back to the modeler 1003, but the individual record is kept hidden. In the second embodiment, the modeler 1003 is told that credit scores are used, but the range is kept hidden on the data silo 1002. In the third embodiment, the credit score feature itself is kept hidden from the modeler 1003. In this third embodiment, the model itself is distributed on each data silo, and the core modeler 1003 has no knowledge of the rules used on each data silo 1002.
The collaboration between distributed components results in a set of rules generated through a rule-based induction algorithm. The DensiCube induction algorithm, in an iterative fashion, determines the data partitions based on the feature rule based on the syntactic representation (e.g., if feature F>20 and F<−25). It dichotomizes (splits) the data into partitions. Each partition is evaluated by computing statistical quality measures. Specifically, the DensiCube uses an F-Score measure to compute the predictive quality of a specific partition. In binary classification, the F-score measure is a measure of a test's accuracy and is defined as the weighted harmonic mean of the test's precision and recall, Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while Recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of instances.
Specifically, the following steps are executed by Distributed DensiCube:
1) The modeler 1003 invokes feature managers 1004 that subsequently start data partitioning based on the local set of features at the data silo 1002. This process is called specialization.
2) Feature managers 1004 push their computed partitions (i.e., using the data identifier as the partition identifier) and their corresponding evaluation measures (e.g., F-score) to modelers 1003.
3) Each feature model manager 1008 compares evaluation measures of the sent partitions and selects the top N best partitions (i.e. specifically it establishes the global beam search for the top performing partitions and their combinations).
4) Subsequently, the modeler 1003 proceeds to the process of generating partition combinations. The first iteration of such combinations syntactically represents two-conditional rules (i.e., a partition is represented by a joint of lower and upper bounds of two features). Once this process is completed the identifiers of the two-conditional rules are sent to the feature managers 1004. Once received, feature managers 1004 evaluate the new partitions identified by the identifiers by executing the next iteration specialization.
A data manager 1012 is a logical construct which is comprised of a data orchestrator 1005 and one or more feature data managers 1006, which cooperate to manage data sets. Data sets can be used to create models and/or to make predictions using models. A data orchestrator 1005 is a component which provides services to maintain Data Sets, is identified by its host domain and port, and has a name which is not necessarily unique. A feature data manager 1006 is a component which provides services to maintain Feature Data Sets 1203, is identified by its host domain and port, and has a name which is not necessarily unique. A data set lives in a data orchestrator 1005, has a unique ID within the data orchestrator 1005, consists of a junction of Feature Data Sets 1203, joins Feature Data Sets 1203 on specified unique features, and is virtual tabular data (see
A model manager 1013 is a logical construct which is comprised of a model orchestrator 1007 and one or more feature model managers 1008, which cooperate to generate models.
A prediction manager 1014 is a logical construct which is comprised of a prediction orchestrator 1010 and one or more feature prediction managers 1011, which cooperate to create scores and statistics (a.k.a. predictions).
The distributed scoring process is accomplished in two steps. First, partial scores are calculated on each feature manager 1004 on each server. Then, complete scores are calculated from the partial scores.
The combined scores are the sum of the scores from each server divided by the sum of the weights from each server, multiplied by two:
In this formula, the score for servers A and B are similar to the DensiCube scoring described above.
The weights are also determined for each location, as above.
With the combined score, we have a metric to show the validity of the selected model.
At the initialization of the machine learning model generation process, each feature manager 1004 is set up on the local servers 1002. Each feature manager 1004 must be uniquely named (e.g., within the subnet where it lives). The port number where the feature manager 1004 can be reached needs to be defined. Access control needs to be configured, with a certificate for the feature manager 1004 installed and the public key for each modeler 1003 and feature prediction manager 1011 installed to allow access to this feature manager 1004. Each local feature manager 1004 needs to broadcast the name, host, port, and public key of the feature manager 1004. In some embodiments, the feature manager 1004 needs to listen to other broadcasts to verify uniqueness.
Next, the data sources are defined. As seen in
Each Data Source shall be described by a name for the data source and a plurality of columns, where each column has a name, a data type, and a uniqueness field. Data Sources can be used by feature model managers 1008 or feature prediction managers 1011 or both. Data Sources are probably defined by calls from a modeler 1003.
The next step involves defining the Data Set Templates. A Data Set Template is a specification of how to join Data Sources defined within a feature data manager 1006. Each Data Set Template must be uniquely identified by name within a feature data manager 1006. A Data Set Template is a definition of Columns without regard to the Rows in each Data Source. For example, a Data Set Template could be represented by a SQL select statement with columns and join conditions, but without a where clause to limit rows. Data Set Templates can be used by feature model managers 1008 or feature prediction managers 1011 or both. Data Set Templates are probably defined by calls from a feature model manager 1008.
Once the Data Set Templates are set up, the next step is to define the Data Sets. A Data Set is tabular data which is a subset of a data from the Data Sources defined within a feature data manager 1006. Each Data Set must be uniquely identified by name within a feature data manager 1006. A Data Set is defined by a Data Set Template to define the columns and a set of filters to define the rows. For example, the filter could be the where clause in a SQL statement. Data Sets can be used by modelers 1003 or feature prediction managers 1011 or both. Data Sets are probably defined by calls from a modeler 1003.
In
In the setup of the model orchestrator 1007, each modeler 1003 should be uniquely named, at least within the subnet where it lives. However, in some embodiments, the uniqueness may not be enforceable. Next, the access control is configured by installing a certificate for the modeler 1003 and installing the public key for each feature manager 1004 containing pertinent data. The public key for each feature prediction manager 1011 is also installed, to which this modeler 1003 can publish.
Once set up, the model orchestrator 1007 establishes a connection to each feature model manager 1008.
Then the Model Data Set templates are defined. A Model Data Set Template is a conjunction of Data Set Templates from feature data managers 1006. Each Data Set Template must be uniquely named within the feature manager 1004. The Data Set Templates on feature data managers 1006 are defined, as are the join conditions. A join condition is an equality expression between unique columns on two Data Sets. For example <Feature Manager A>.<Data Set Template 1>.<Column a>==<Feature Manager B>.<Data Set Template 2>.<Column b>. Each data set participating in the model data set must be joined such that a singular virtual tabular data set is defined.
After the templates are defined, the model data sets themselves are defined. A Model Data Set is a conjunction of Data Sets from feature data managers 1006. The Model Data Set is a row filter applied to a Model Data Set Template. Each Data Set must be uniquely named within a Model Data Set Template. Then the data sets on the feature data managers 1006 are defined. This filters the rows.
Next, the Modeling Parameters are defined. Modeling Parameters define how a Model is created on any Model Data Set which is derived from a Model Data Set Template. Each Modeling Parameters definition must be unique within a Model Data Set Template.
Then, a model is created and published. A model is created by applying Modeling Parameters to a Model Data Set. Each Model must be uniquely identified by name within a Model Data Set. A Model can be published to a feature prediction manager 1011. Publishing will persist the Model artifacts in the feature model managers 1008 and feature prediction managers 1011. Following are some of the artifacts which will be persisted to either the feature data manager 1008 and/or feature prediction manager 1011: Data set templates, model data set templates, and the model.
The prediction orchestrator 1010 setup begins with the configuration of the access control. This is done by installing a certificate for the feature prediction manager 1011 and installing the public key for each modeler 1003 allowed to access this prediction orchestrator 1010. The public key for each feature manager 1004 containing pertinent data is also installed. Each prediction orchestrator 1010 should be uniquely named, but in some embodiments, this may not be enforced.
Next, a connection to each feature prediction manager 1011 is established and to a model orchestrator 1007. The model orchestrator 1007 will publish the Model Data Set Template and Model to the prediction orchestrator 1010.
The scoring data sets are then defined. A Scoring Data Set is a conjunction of Data Sets from the feature data managers 1006. It is a row filter applied to a Model Data Set Template. Each Data Set must be uniquely named within a Model Data Set Template. The data sets on the feature data managers 1006 are defined (this filters the rows).
Then the Scoring Parameters are defined. Scoring Parameters define how Scores are calculated on any Score Data Set which is derived from a Model Data Set Template. Each Scoring Parameters definition must be unique within a Model Data Set Template.
Finally, a Scoring Data Set is defined. Partial Scores are calculated on each feature manager 1004 in the feature prediction manager 1011. See
Looking at
The feature managers 1004 on each of the data silos 1002 then initialize the site 1311, 1321, 1331. The data on the silo 1002 is then sliced, using the list of IDs and the features 1312, 1322, 1332 into a data set of interest, by the feature data manager 1006. The DensiCube algorithm 1313, 1323, 1333 is then run by the feature model manager 1008 on the data of interest, as seen in
The rules, in some embodiments, are then returned to the prediction orchestrator 1010 where they are combined into an overall rule 1304, as seen in FIG. 11A. Next, the F-Scores are combined 1305 by the prediction orchestrator 1010 into an overall F-Score for the generated rule using the formulas in
Modifications to the scoring algorithms to support privacy preservation in the data silos.
The foregoing devices and operations, including their implementation, will be familiar to, and understood by, those having ordinary skill in the art.
The above description of the embodiments, alternative embodiments, and specific examples, are given by way of illustration and should not be viewed as limiting. Further, many changes and modifications within the scope of the present embodiments may be made without departing from the spirit thereof, and the present invention includes such changes and modifications.
BACKGROUND This application is a continuation patent application from U.S. patent application Ser. No. 17/864,704, “Machine Learning Engine using a Distributed Predictive Analytics Data Set”, filed Jul. 14, 2022, by Paul Green and Jerzy Bala, now U.S. Pat. No. 11,609,971, issued on Mar. 20, 2023, said application is incorporated herein by reference in its entirety. U.S. patent application Ser. No. 17/864,704 is a continuation patent application from U.S. patent application Ser. No. 16/355,985, “Distributed Predictive Analytics Data Set”, filed Mar. 18, 2019, by Jerzy Bala and Paul Green, now U.S. Pat. No. 11,416,713, issued on Aug. 16, 2022, said application is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
4575793 | Morel et al. | Mar 1986 | A |
5228122 | Cahn et al. | Jul 1993 | A |
5559961 | Blonder | Sep 1996 | A |
5600735 | Seybold | Feb 1997 | A |
5600835 | Garland et al. | Feb 1997 | A |
5634008 | Gaffaney et al. | May 1997 | A |
5644717 | Clark | Jul 1997 | A |
5790798 | Beckett et al. | Aug 1998 | A |
5845369 | Dunchock | Dec 1998 | A |
5912669 | Hsia | Jun 1999 | A |
5961592 | Hsia | Oct 1999 | A |
5970482 | Pham et al. | Oct 1999 | A |
6044401 | Harvey | Mar 2000 | A |
6192411 | Chan et al. | Feb 2001 | B1 |
6205416 | Butts et al. | Mar 2001 | B1 |
6256737 | Bianco et al. | Jul 2001 | B1 |
6523016 | Michalski | Feb 2003 | B1 |
6651099 | Dietz et al. | Nov 2003 | B1 |
6675164 | Kamath et al. | Jan 2004 | B2 |
6687693 | Cereghini | Feb 2004 | B2 |
6708163 | Kargupta et al. | Mar 2004 | B1 |
6801190 | Robinson et al. | Oct 2004 | B1 |
6845369 | Rodenburg | Jan 2005 | B1 |
7092941 | Campos | Aug 2006 | B1 |
7174462 | Pering et al. | Feb 2007 | B2 |
7308436 | Bala et al. | Dec 2007 | B2 |
7415509 | Kaltenmark et al. | Aug 2008 | B1 |
7730521 | Thesayi et al. | Jun 2010 | B1 |
7822598 | Carus et al. | Oct 2010 | B2 |
7831703 | Krelbaum et al. | Nov 2010 | B2 |
7860783 | Yang et al. | Dec 2010 | B2 |
7992202 | Won et al. | Aug 2011 | B2 |
8229875 | Roychowdhury | Jul 2012 | B2 |
8229876 | Roychowdhury | Jul 2012 | B2 |
8392975 | Raghunath | Mar 2013 | B1 |
8429745 | Casaburi et al. | Apr 2013 | B1 |
8433791 | Krelbaum et al. | Apr 2013 | B2 |
8515862 | Zhang et al. | Aug 2013 | B2 |
8638939 | Casey et al. | Jan 2014 | B1 |
8650624 | Griffin et al. | Feb 2014 | B2 |
8776213 | McLaughlin et al. | Jul 2014 | B2 |
8844059 | Manmohan | Sep 2014 | B1 |
8881005 | Al et al. | Nov 2014 | B2 |
9015036 | Karov et al. | Apr 2015 | B2 |
9489627 | Bala | Nov 2016 | B2 |
9529678 | Krelbaum et al. | Dec 2016 | B2 |
9537848 | McLaughlin et al. | Jan 2017 | B2 |
9607103 | Anderson | Mar 2017 | B2 |
9667609 | McLaughlin et al. | May 2017 | B2 |
9691085 | Scheidelman | Jun 2017 | B2 |
10037533 | Caldera | Jul 2018 | B2 |
10152680 | Myrick et al. | Dec 2018 | B1 |
10235356 | Amend et al. | Mar 2019 | B2 |
10242258 | Guo et al. | Mar 2019 | B2 |
10320800 | Guo et al. | Jun 2019 | B2 |
10402817 | Benkreira et al. | Sep 2019 | B1 |
10414197 | Jesurum | Sep 2019 | B2 |
10440015 | Pham et al. | Oct 2019 | B1 |
10467631 | Dhurandhar et al. | Nov 2019 | B2 |
10510083 | Vukich et al. | Dec 2019 | B1 |
10511605 | Ramberg et al. | Dec 2019 | B2 |
10523681 | Bulgakov et al. | Dec 2019 | B1 |
10552837 | Jia et al. | Feb 2020 | B2 |
10552841 | Dixit | Feb 2020 | B1 |
10607008 | Byrne et al. | Mar 2020 | B2 |
10607228 | Gai | Mar 2020 | B1 |
10621587 | Binns et al. | Apr 2020 | B2 |
10699075 | Amend et al. | Jun 2020 | B2 |
10824809 | Kutsch et al. | Nov 2020 | B2 |
11003999 | Gil | May 2021 | B1 |
11042555 | Kane et al. | Jun 2021 | B1 |
11194846 | Stenneth | Dec 2021 | B2 |
11321330 | Pandis | May 2022 | B1 |
11323513 | Vibhor | May 2022 | B1 |
20020019945 | Houston et al. | Feb 2002 | A1 |
20020056043 | Glass | May 2002 | A1 |
20020065938 | Jungck et al. | May 2002 | A1 |
20020080123 | Kennedy et al. | Jun 2002 | A1 |
20020099649 | Lee et al. | Jul 2002 | A1 |
20020163934 | Moore et al. | Nov 2002 | A1 |
20020194159 | Kamath | Dec 2002 | A1 |
20030041042 | Cohen | Feb 2003 | A1 |
20030083764 | Hong | May 2003 | A1 |
20030110394 | Sharp et al. | Jun 2003 | A1 |
20030135612 | Huntington et al. | Jul 2003 | A1 |
20030212629 | King | Nov 2003 | A1 |
20030233305 | Solomon | Dec 2003 | A1 |
20040034666 | Chen | Feb 2004 | A1 |
20040186882 | Ting | Sep 2004 | A1 |
20040193512 | Gobin et al. | Sep 2004 | A1 |
20050021650 | Gusler et al. | Jan 2005 | A1 |
20050154692 | Jacobsen et al. | Jan 2005 | A1 |
20050081158 | Hwang | Apr 2005 | A1 |
20050177483 | Napier et al. | Aug 2005 | A1 |
20060101048 | Mazzagatti et al. | May 2006 | A1 |
20060155751 | Geshwind et al. | Jul 2006 | A1 |
20060190310 | Gudia et al. | Aug 2006 | A1 |
20060212270 | Shiu et al. | Sep 2006 | A1 |
20070277224 | Osborn et al. | Nov 2007 | A1 |
20080028446 | Burgoyne | Jan 2008 | A1 |
20080104007 | Bala | May 2008 | A1 |
20090059793 | Greenberg | Mar 2009 | A1 |
20090094677 | Pietraszek et al. | Apr 2009 | A1 |
20090140838 | Newman et al. | Jun 2009 | A1 |
20090174667 | Kocienda et al. | Jul 2009 | A1 |
20090201257 | Saitoh et al. | Aug 2009 | A1 |
20090202153 | Cortopassi et al. | Aug 2009 | A1 |
20090307176 | Jeong et al. | Dec 2009 | A1 |
20090313693 | Rogers | Dec 2009 | A1 |
20100066540 | Theobald et al. | Mar 2010 | A1 |
20100130181 | Won | May 2010 | A1 |
20100169958 | Werner et al. | Jul 2010 | A1 |
20100225443 | Bayram et al. | Sep 2010 | A1 |
20110055907 | Narasimhan et al. | Mar 2011 | A1 |
20110070864 | Karam et al. | Mar 2011 | A1 |
20110082911 | Agnoni et al. | Apr 2011 | A1 |
20110145587 | Park | Jun 2011 | A1 |
20110251951 | Kolkowitz et al. | Oct 2011 | A1 |
20110298753 | Chuang et al. | Dec 2011 | A1 |
20120041683 | Vaske et al. | Feb 2012 | A1 |
20120124662 | Baca et al. | May 2012 | A1 |
20120127102 | Uenohara et al. | May 2012 | A1 |
20120151553 | Burgess et al. | Jun 2012 | A1 |
20130071816 | Singh et al. | Mar 2013 | A1 |
20130117246 | Cabaniols et al. | May 2013 | A1 |
20130231974 | Harris et al. | Sep 2013 | A1 |
20130339141 | Stibel et al. | Dec 2013 | A1 |
20140006347 | Qureshi et al. | Jan 2014 | A1 |
20140067656 | Cohen et al. | Mar 2014 | A1 |
20140149130 | Getchius | May 2014 | A1 |
20140366159 | Cohen | Dec 2014 | A1 |
20150039473 | Hu et al. | Feb 2015 | A1 |
20150220509 | Karov Zangvil et al. | Aug 2015 | A1 |
20150264573 | Giordano et al. | Sep 2015 | A1 |
20150348041 | Campbell et al. | Dec 2015 | A1 |
20160041984 | Kaneda et al. | Feb 2016 | A1 |
20160352759 | Zhai | Dec 2016 | A1 |
20170039219 | Acharya et al. | Feb 2017 | A1 |
20170154382 | McLaughlin et al. | Jun 2017 | A1 |
20170163664 | Nagalla et al. | Jun 2017 | A1 |
20170177743 | Bhattacharjee et al. | Jun 2017 | A1 |
20170300911 | Alnajem | Oct 2017 | A1 |
20180107944 | Lin et al. | Apr 2018 | A1 |
20180342328 | Chan | Nov 2018 | A1 |
20180349924 | Shah et al. | Dec 2018 | A1 |
20190197189 | Studnicka | Jun 2019 | A1 |
20190205977 | Way | Jul 2019 | A1 |
20190228411 | Hernandez-Ellsworth et al. | Jul 2019 | A1 |
20190347281 | Natterer | Nov 2019 | A1 |
20190349371 | Smith et al. | Nov 2019 | A1 |
20190373001 | Deeb et al. | Dec 2019 | A1 |
20190392487 | Duke | Dec 2019 | A1 |
20200019964 | Miller et al. | Jan 2020 | A1 |
20200117800 | Ramberg et al. | Apr 2020 | A1 |
20210049326 | Amend et al. | Feb 2021 | A1 |
20210224663 | Gil | Jul 2021 | A1 |
20220156526 | Chen | May 2022 | A1 |
Number | Date | Country |
---|---|---|
1211865 | Jun 2002 | EP |
1706960 | Oct 2006 | EP |
2653982 | Oct 2013 | EP |
2636149 | Oct 2016 | EP |
176551 | Sep 2012 | IL |
219405 | Mar 2007 | IN |
10-0723738 | May 2007 | KR |
201723907 | Jul 2017 | TW |
0125914 | Apr 2001 | WO |
0287124 | Oct 2002 | WO |
2002100039 | Dec 2002 | WO |
0373724 | Sep 2003 | WO |
2005067209 | Jul 2005 | WO |
2012061701 | May 2012 | WO |
2014145395 | Sep 2014 | WO |
2017096206 | Jun 2017 | WO |
2017209799 | Dec 2017 | WO |
Entry |
---|
“Distributed Mining of Classification Rules”, By Cho and Wuthrich, 2002 http://www.springerlink.com/(21nnasudlakyzciv54i5kxz0)/app/home/contribution.asp?referrer-parent&backto=issue, 1,6;journal,2,3,31;linkingpublicationresults, 1:105441, 1. |
Bansal, Nikhil, Avrim Blum, and Shuchi Chawla. “Correlation clustering.” Machine Learning 56.1-3 (2004): 89-113. |
Finley, Thomas, and Thorsten Joachims. “Supervised clustering with support vector machines.” Proceedings of the 22nd international conference on Machine learning, ACM, 2005. |
Meia et al., Comparing clusterings—an information based distance, Journal of Multivariate Analysis 98 (2007) 873-895. |
Kim, Jinkyu, et al, Collaboratice Analytics for Data Silos, Jun. 2016, IEEE Xplore, pp. 743-754 (year: 2016). |
Appaloosa Store, “Siring Similarity Algorithms Compared”, Apr. 5, 2018, webpage downloaded on Oct. 20, 2020 rom https://medium.com/@appaloosastore/string-similarity-algorithms-compared-3f7b4d12f0ff. |
Banon, Shay, “Geo Location and Search”, elastic blog post, Aug. 16, 2010, webpage found at https://www.elastic.co/blog/geo-location-and-search on Oct. 15, 2019. |
Bottomline Technologies, Bottomline Cyber Fraud & Risk Management:Secure Payments, marketing brochure. |
Brasetvik, Alex, “Elasticsearch from the Bottom up, Part 1”, Elastic, Sep. 16, 2013. Webpage found at https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up on Jun. 17, 2019. |
Co-pending U.S. Appl. No. 13/135,507, filed Jul. 7, 2011. |
Experian, “Fuzzy address searching”, webpage downloaded from https://www.edq.com/glossary/fuzzy-address-searching/ on Oct. 8, 2019. |
Fenz, Dustin, et al., “Efficient Similarity Search in Very Large String Sets”, conference paper, Jun. 2012. |
G. Kou, Y. Peng, Y. Shi, M. Wise, W. Xu, Discovering credit cardholders behavior by multiple criteria linear programming, Annals of Operations Research 135, (2005) 261-274. |
Haydn Shaughnessy, Solving the $190 billion Annual Fraud Problem: More on Jumio, Forbes, Mar. 24, 2011. |
IdentityMing, Accelerated Fintech Compliance and Powerful Online Fraud Prevention Tools, website found at https://identitymindglobal.com/momentum/ on Dec. 12, 2018. |
International Search Report and Written Opinion received for PCT Patent Application No. PCT/IL05/000027, dated Jun. 2, 2005, 8 pages. |
International Search Report and Written Opinion received for PCT Patent Application No. PCT/US17/13148, dated May 19, 2017, 11 pages. |
International Search Report for corresponding International Application No. PCT/US2016/064689 dated Feb. 22, 2017. |
Jeremy Olshan, How my bank tracked me to catch a thief, MarketWatch, Apr. 18, 2015. |
Mitchell, Stuart, et al., “pulp Documentation”, Release 1.4.6, Jan. 27, 2010. |
Postel et al.; “Telnet Protocol Specification” RFC 854; entered into the case on Apr. 18, 2013. |
RodOn, “location extraction with fuzzy matching capabilities”, Blog post on StackOverflow.com, Jul. 8, 2014, webpage downloaded from https://stackoverflow.com/questions/24622693/location-extraction-with-fuzzy-matching-capabilities on Oct. 8, 2019. |
Rosette Text Analytics, “An Overview of Fuzzy Name Matching Techniques”, Blog, Dec. 12, 2017, webpage downloaded from https://www.rosette.com/blog/overview-fuzzy-name-matching-techniques/ on Oct. 15, 2019. |
Samaneh Sorournejad, Zahra Zojaji, Reza Ebrahimi Atani, Amir Hassan Monadjemi, “A Survey of Credit Card Fraud Detection Techniques: Data and Technique Oriented Perspective”, 2016. |
Schulz, Klaus and Stoyan Mihov, “Fast String Correction with Levenshtein-Automata”, IJDAR (2002) 5: 67. https://doi.org/10.1007/s10032-002-0082-8. |
The Telnet Protocol Microsoft Knowledgebase; entered into the case on Apr. 18, 2013. |
Vogler, Raffael, “Comparison of Siring Distance Algorithms”, Aug. 21, 2013, webpage downloaded on Oct. 20, 2020 from https://www.joyofdala.de/blog/comparison-of-string-distance-algorithms. |
Wikil Kwak, Yong Shi, John J. Cheh, and Heeseok Lee, “Multiple Criteria Linear Programming Data Mining Approach: An Application for Bankruptcy Prediction”, : Data Mining and Knowledge Management, Chinese Academy of Sciences Symposium, 2004, LNAI 3327, pp. 164-173, 2004. |
Wikipedia, “Autoencoder”, web page downloaded from http://en.wikipedia.org/wiki/Autoencoder on Dec. 18, 2020. |
Wikipedia, “Damerau-Levenshtein distance”, webpage downloaded on Oct. 20, 2020 from https://en.wikipedia.org/wiki/Damerau-Levenshtein_distance. |
Written Opinion of the International Searching authority for corresponding International Application No. PCT/US2016/064689 dated Feb. 22, 2017. |
Number | Date | Country | |
---|---|---|---|
20230244758 A1 | Aug 2023 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17864704 | Jul 2022 | US |
Child | 18123529 | US | |
Parent | 16355985 | Mar 2019 | US |
Child | 17864704 | US |