Aspects described herein relate generally to machine learning models, training of machine learning models, and configuring of machine learning models. Further aspects described herein may relate to implementing machine learning models based on training data, or other data, that is received from a plurality of data sources.
Implementing a machine learning model so that it is suitable for its intended purpose may be a time consuming and difficult process. The time consuming and difficult nature of implementing a machine learning model may be illustrated by the challenges in training, or otherwise configuring, a machine learning model as the model itself grows in size. For example, training a machine learning model, of a particular size, may use a volume of training data that is insufficient for training larger machine learning models. Indeed, as machine learning models grow increasingly large, the volume of training data sufficient for training these increasingly large machine learning models may grow exponentially. This increases the difficulty both in gathering an appropriately-sized training set for training a machine learning model and in meeting the demand for computation power required for performing the training.
Moreover, the time consuming and challenging nature of implementing a machine learning model may be illustrated by the challenges in processing training data, or other data, received from a plurality of data sources. For example, each data source may have its own procedure in how its training data is to be handled and these procedures may be of variable complexity. Further, these procedures may require enforcement of data privacy, data security, and/or data confidentiality. In this way, ensuring these procedures are followed bring numerous challenges in training, or otherwise configuring, a machine learning model based on the training data received from a plurality of data sources. The above examples are only a few of the challenges that may illustrate the time consuming and difficult process of implementing a machine learning model.
The following paragraphs present a simplified summary of various aspects described herein. This summary is not an extensive overview, and is not intended to identify key or critical elements or to delineate the scope of any claim. This summary merely presents some concepts in a simplified form as an introductory prelude to the more detailed description provided below.
Aspects described herein may address the above-mentioned challenges and difficulties, and generally improve training, and configuring of, one or more machine learning models. Further, aspects described herein may address one or more challenges and difficulties in implementing machine learning models based on training data, or other data, received from a plurality of data sources.
Aspects described herein relate to aggregating data records received from a plurality of data sources and selecting, for each of the plurality of data sources, a subset of data from the resulting aggregated data records. The aggregation and selecting processes may be performed in a randomized fashion. Further, the subsets of data may have portions that overlap with each other. Each subset may be used to train a model. Configuration information from any model trained in this way may then be used to configure an aggregated model. The overlap may also be used as basis for configuring the aggregated model. Once the aggregated model is configured, the aggregated model may be used to determine predictions.
The manner in which the data is aggregated and otherwise processed may address one or more challenges and difficulties in implementing machine learning models. For example, the various ways in which data is aggregated may improve data privacy and/or data security. The data, as it is aggregated, may be ordered in a randomized fashion and, due to the randomized fashion, it may be more difficult to determine which source sent particular portions of the resulting aggregated data. As another example, confidential data may be hashed or encrypted such that the confidential data is not directly disclosed to other data sources. The manner in which the selections from the aggregated data are performed may address one or more further challenges and difficulties in implementing machine learning models. For example, the various ways in which subsets of data are selected and then used for training a plurality of models may improve the manner in which larger machine learning models can be configured.
These features, along with many others, are discussed in greater detail below. Corresponding apparatus, systems, and computer-readable media are also within the scope of the disclosure.
The present disclosure is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
In the following description of the various embodiments, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration various embodiments in which aspects of the disclosure may be practiced. It is to be understood that other embodiments may be utilized and structural and functional modifications may be made without departing from the scope of the present disclosure. Aspects of the disclosure are capable of other embodiments and of being practiced or being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein are for the purpose of description and should not be regarded as limiting. Rather, the phrases and terms used herein are to be given their broadest interpretation and meaning. The use of “including” and “comprising” and variations thereof is meant to encompass the items listed thereafter and equivalents thereof as well as additional items and equivalents thereof.
Throughout this disclosure, the phrases “confidential information” and “confidential data” are used and may refer to any information or data that is subject to confidentiality procedures that restrict access and/or restrict disclosure of the confidential information or confidential data. Examples of confidential information or confidential data may include account numbers, social security numbers, and the like. As an example of a confidentiality procedure, the confidential information or confidential data may be prevented from being disclosed to any user, device, or entity that does not have appropriate access rights. A confidentiality procedure may be defined by one or more data policies. For example, a data source may make data records available for training machine learning models or as part of some other data sharing agreement. The data source may make its data records available subject to a particular data policy, and this data policy may indicate, for example, that access to account numbers or other customer information is to be restricted in some manner A confidentiality procedure may be based on one or more legal or regulatory requirements. For example, social security numbers may be subject to one or more confidentiality procedures based on one or more United States Federal laws or regulations.
By way of introduction, aspects discussed herein may relate to methods and techniques for training, or otherwise configuring, one or more machine learning models based on data received from a plurality of data sources. The one or more machine learning models may include a machine learning model for each data source. In other words, each data source may have its own machine learning model trained according to the methods and techniques described herein. The one or more machine learning models may also include a machine learning model trained, or configured, based on an aggregation of the data received from the plurality of data sources. In this way, the data received from the plurality of data sources may be collected together and used to train, or otherwise, configure a machine learning model. The methods and techniques described herein, and/or various combinations of the features described herein, may improve training, and configuring of, one or more machine learning models. Further, methods and techniques described herein, and/or various combinations of the features described herein may improve the ability to implement machine learning models based on training data, or other data, received from a plurality of data sources.
A machine learning model may be referred interchangeably herein as a model. Throughout this disclosure, the different instances of “training a model” and “configuring a model” is intentional and indicates different processes. For example, training a model may include performing a training algorithm using a particular set of training data. Configuring a model may include determining a configuration of the model based on the configuration of one or more other models and then configuring the model according to the determined configuration. In other words, configuring a model may not involve performing a training algorithm for that model. Instead, training algorithms may be performed for other models. After the training algorithms are complete, the other models can be used as a basis for determining the configuration of the model. For simplicity, throughout this disclosure, a model that is configured (and not trained) will be referred to as an aggregated model. A model that is trained (and not configured) will be referred to as a first model, second model, third model, fourth model, and the like. The details of these methods and techniques, among other aspects, will be described in more detail herein.
For simplicity, the example computing environment 100 depicts the computing platform 110 as receiving data from two data sources. The computing platform 110 receives one or more first data records 105 from a first data source 102 and receives one or more second data records 107 from a second data source 103. The two data sources 102, 103 are shown as one example. As will become apparent based on other examples discussed throughout this disclosure, the exact number of data sources is not limited to two data sources. Moreover, the computing platform 110 may be a data source itself. In this way, the computing platform 110 may be a third data source and may provide its own one or more third data records (not shown) that can be aggregated with the one or more first data records 105 and the one or more second data records 107. In such instances, the computing platform 110 may have its own model (e.g., a third model, not shown) in addition to the aggregated model 125 and may perform processes similar to those discussed in connection with the two data sources 102, 103. Additionally, each of the two data sources 102, 103 is depicted as being a single device. An example of a suitable single device may be a server, a personal computer, a mobile device, an automated teller machine, or the like. The server, personal computer, or mobile device, for example, may be associated with a customer or client of a service provided via the computing platform 110 (e.g., a banking service where the computing platform 110 is operated by a bank). The server, personal computer, or mobile device may be generating data records associated with the customer or client and forwarding them to the computing platform 110 for processing. In addition to the two data sources 102, 103 being implemented as a single device, the two data sources 102, 103 may each be implemented on one or more computing devices, one or more computing systems, one or more computing platforms, and/or other arrangement of devices that are configured to perform processes similar to those discussed in connection with the two data sources 102, 103. Computing devices, computing systems, and/or computing platforms of a data source may be interconnected to each other via one or more networks (not shown). The computing platform 110 is depicted as being four devices, but may be implemented as one or more computing devices.
As also depicted, each of the two data sources 102, 103 and the computing platform 110 is depicted as having its own model. The first data source 102 has a first model 116, the second data source 103 has a second model 117, and the computing platform 110 has an aggregated model 125. Each model discussed throughout this disclosure, including those depicted in
The first model 116, the second model 117, and/or the aggregated model 125 may be the same or similar to each other in size. As one example, each of the first model 116, the second model 117, and the aggregated model 125 may include a neural network, and each neural network may have the same number of inputs, layers, and outputs. The first model 116, the second model 117, and/or the aggregated model 125 may be of different sizes. As one example, each of the first model 116, the second model 117, and the aggregated model 125 may include a neural network, and each neural network may include a different number of inputs, layers, and outputs from the other neural networks. As another example, the first model 116 and the second model 117 may include neural networks that have the same number of inputs, layers, and outputs as each other, but aggregated model 125 may include a larger neural network (e.g., have more inputs, layers, and/or outputs than the neural networks of the first model 116 and the second model 117).
As depicted in
The computing platform 110 and the devices of the two data sources 102, 103 are depicted in
After receiving the first selected data 113 from the computing platform 110, the first data source 102 may train the first model 116 based on the first selected data 113. Similarly, after receiving the second selected data 115 from the computing platform 110, the second data source 103 may train the second model 117 based on the second selected data 115. After training, each of the first model 116 and the second model 117 may be usable to determine, respectively, first prediction data 118 and second prediction data 119. Additionally, after training, both data sources 102, 103 may be able to extract, or otherwise determine, configuration information of the trained models 116, 117. The configuration information may include weights, biases, and/or any other learned or configurable parameter of the trained models 116, 117. The types of parameters that can be included by the configuration information may depend on the type of model being used (e.g., a neural network-based model may have configuration that includes weights and/or biases). The configuration information may be sent to the computing platform 110 to be used as a basis for configuring the aggregated model 125. In this way,
Based on the first configuration information 120 and the second configuration information 121, the computing platform may perform a configuration information aggregation process 110-4 that, in part, determines aggregated configuration information 123 for the aggregated model 125. The aggregated configuration information 123 may include one or more aggregated weights for the aggregated model 125 and/or one or more aggregated biases for the aggregated model 125. The one or more aggregated model weights may, for example, combine the one or more first model weights with the one or more second model weights (e.g., by summing, normalizing, and/or some other process). The one or more aggregated biases may combine the one or more first biases with the one or more second biases (e.g., by summing, by normalizing, by using an or operator, by using an exclusive or operator, and/or by some other process). The computing platform 110 may then configure the aggregated model 125 using the aggregated configuration information 123. After the aggregated model 125 is configured, the aggregated model 125 may be usable to determine aggregated prediction data 126. Additionally, after the aggregated model 125 is configured, the aggregated model 125 and/or the aggregated configuration information 123 may be sent (not shown) to one or more of the data sources 102, 103. In this way, the data sources 102, 103 may be able to make predictions using the aggregated model 125 and/or another model configured using the aggregated configuration information 123.
To begin the examples that combine the example computing environment 100 of
Once collected, or otherwise generated, the one or more first data records 105 and the one or more second data records 107 may be sent from their respective data source 102, 103 to the computing platform 110. After receipt, the computing platform 110 may perform data pre-processing and a randomized data aggregation process 110-1 on the one or more first data records 105 and the one or more second data records 107. The data pre-processing and the randomized data aggregation process 110-1 may, in part, determine aggregated data 111.
As depicted, the formatting of the example aggregated data 205 provides one or more examples of the data pre-processing. For example, if the example aggregated data 205 is compared to the example second data record 203, the number of columns for rows 203-r1 and 203-r2 has changed such that each row of the example aggregated data 205 has the same number of columns. In this way, data pre-processing may include one or more reformatting processes so that each row included in the example aggregated data 205 has the same number of columns.
Reformatting so that each row of the example aggregated data 205 has the same number of columns can be performed in various ways. For example, the reformatting may be performed based on a maximum number of columns, based on the columns of the data records aligning with each other, or based on the columns of the data records not aligning with each other. The reformatting may be performed by way of appending one or more columns to a data record and/or by way of reordering the columns of a data record. Any information needed to determine the maximum number of columns in the example data records 201, 203, and/or whether the columns of the example data records 201, 203 will or will not align may be received from the two data sources 102, 103 and/or input by a user of the computing platform 110.
As depicted, the example aggregated data 205 includes eight columns 205-c1 to 205-c8. The example aggregated data 205 may have eight columns based on that being the maximum number of columns between the example first data record 201 and the example second data record 203. Indeed, the example first data record 201 includes eight columns and the example second data record 203 includes seven columns. Accordingly, to compensate for the differences in the number of columns between the two data records 201, 203, the one or more reformatting processes may append an eighth column to the rows 203-r1, 203-r2 of the example second data record 203. In this way, the one or more reformatting processes may be performed based on a maximum number of columns.
Further, the example aggregated data 205 may have eight columns based on the columns of the example first data record 201 and the example second data record 203 aligning if the rows 203-r1, 203-r2 of the example second data record 203 are appended with an additional column having blank cells. Such alignment may occur if, for example, the data records indicate the same or similar types of transactions. As one particular example, the example first data record 201 may be indicative of banking transactions with a first bank and the example second data record 203 may be indicative of banking transactions with a second bank. The columns 201-c1 to 201-c7 of the example first data record 201 and the columns 203-c1 to 201-c7 of the example second data record 203 may be for the same types of data (e.g., account number, total amount deposited, withdrawal amount, deposit amount, etc.) and may be in the same order. The first bank may track additional information that is not tracked by the second bank and, thus, may include the eighth column 201-c8 of the example first data record 201 for that additional information. Accordingly, to compensate for this additional information being tracked by the first bank, the one or more reformatting processes may append an eighth column of blank cells to the rows 203-r1, 203-r2 of the example second data record 203. In this way, the one or more reformatting processes may be performed based on the columns of the data records aligning with each other.
In some instances, the columns of the data records may align with each other by re-ordering one or more columns (not shown). In such instances, the one or more reformatting processes may re-order the one or more columns so that the columns of the data records align. Such alignment may occur if, for example, the data records indicate the same or similar types of transactions. As one particular example, the example first data record 201 may be indicative of banking transactions with a first bank and the example second data record 203 may be indicative of banking transactions with a second bank. The columns 201-c1 to 201-c7 of the example first data record 201 and the columns 203-c1 to 201-c7 of the example second data record 203 may be for the same types of data (e.g., account number, total amount deposited, withdrawal amount, deposit amount, etc.), but may be in a different order. For example, the fourth column 201-c4 of the example first data record 201 may be for the withdrawal amount, but the fifth column 203-c5 of the example second data record 201 may be for the withdrawal amount. Accordingly, to compensate for this difference in column order for the withdrawal amount, the one or more reformatting processes may modify the column order such that the withdrawal amount is within the same column across all rows 201-r1, 201-r2, 201-r3, 201-r4. In this way, the one or more reformatting processes may be performed based on the columns of the data records aligning with each other.
In some instances, the columns of the data records may not align with each other by appending one or more columns to the data record having fewer columns (not shown). In such instances, the one or more reformatting processes may append one or more columns to each data record. For example, if the sixth and seventh columns 201-c6, 201-c7 of the example first data record 201 are for different types of data than the sixth and seventh columns 203-c6, 201-c7 of the example second data record 203, the one or more reformatting processes may append two columns to each row 201-r1, 201-r2, 203-r1, 203-r2 of the example first data record 201 and the example second data record 203, and may modify the rows of one of the two example data records 201, 203 such that data values of the sixth and seventh columns are moved to the two appended columns. This may result in aggregated data (not shown) that includes ten columns. The data records may not align with each other if, for example, the data records indicate different types of transactions. For example, the example first data record 201 may be indicative of banking transactions with a first bank and the example second data record 203 may be indicative of credit card transactions. One or more columns of the example data records 201, 203 may be for different types of data. For example, the fourth column 201-c4 of the example first data record 201 may be for the withdrawal amount, but the fourth column 203-c4 of the example second data record 201 may be for the amount charged to a credit card. Accordingly, to compensate for this difference in types of data, the one or more reformatting processes may append a column to each row 201-r1, 201-r2, 201-r3, 201-r4, and may move the amount charged to a credit card to the appended row. In this way, the one or more reformatting processes may be performed based on the columns of the data records not aligning with each other.
The above examples provide only a few examples of the ways in which the reformatting may occur so that each row of the example aggregated data 205 has the same number of columns Additional examples may include a combination of the above examples. For example, reformatting may be performed by both appending one or more columns and by re-ordering one or more columns. This may be performed, for example, if the data records are for the same or similar types of transactions, but the data records include different numbers of columns and/or the data records have columns that are in different orders from each other.
Examples of data pre-processing are also provided based on the data values of the cells depicted by the example aggregated data 205. For example, the first column 205-c1 of the example aggregated data 205 includes cells with data values such as H(a), H(t), H(f), and H(p). This represents a hashed data value. In particular, example data values a, t, f, and p have been processed through a hashing algorithm, and the results of the hashing algorithm has been placed in the first column 205-c1 of the example aggregated data 205. In this way, data pre-processing may include hashing one or more data values of one or more cells.
The hashing may be performed, for example, based on one or more confidentiality procedures associated with the example data records 201, 203. As one particular example, one or more confidentiality procedures may indicate that the first column 201-c1, 203-c1 of the example data records 201, 203 includes confidential data (e.g., an account number, credit card number, social security number) and that disclosure of the confidential data should be prevented or otherwise restricted. Accordingly, to prevent the confidential data from being included in the example aggregated data 205, the confidential data of those columns 201-c1, 203-c1 may be hashed. The hashed versions of the confidential data (e.g., H(a), H(t), H(f), and H(p)) may be included as part of the example aggregated data 205 and, thus, the example aggregated data 205 may not include the confidential data itself (e.g., a, t, f, p). The hashing algorithm used to generate the hashed versions (e.g., H(a), H(t), H(f), and H(p)) may be a one-way function such that hashed versions cannot be reversed to reveal the confidential data itself (e.g., a, t, f, p). In this way, data privacy may be improved. The one or more confidentiality procedures may indicate the hashing algorithm that is to be used on the confidential data. Any information needed to determine the one or more confidentiality procedures may be received from the two data sources 102, 103, and/or input by a user of the computing platform 110. The above example where confidential is hashed provides one example of the ways in which data privacy may be improved. An additional example may include encrypting the confidential data instead of hashing. In this way, the data pre-processing, by hashing or encrypting confidential data, may include one or more processes that prevent confidential data from being disclosed.
The above example of hashing or encrypting confidential data is only one example of how confidential data can be prevented from being disclosed. Additional data protection or anonymization techniques can be used in addition to or alternatively from the hashing or encrypting mentioned above. Tokenization, data masking, pseudonymization, generalization, data swapping, data perturbation are all additional examples of techniques that could be used in addition to or alternatively from hashing or encrypting. As one example, tokenization may include replacing the confidential data with an identifier and storing, separate from the data record, a mapping between the identifier and the confidential data that can be used to recover the confidential data if needed. The models of the data sources may be trained using data that includes the identifier. In this way, the data pre-processing may include one or more of these additional or alternative techniques.
The data pre-processing may include additional processes not explicitly shown by the examples of
Examples of the randomized data record aggregation process are also provided by the example aggregated data 205. As depicted, the example aggregated data 205 includes the rows 201-r1, 201-r2, 203-r1, 203-r2 from the example first data record 201 and the example second data record 203. The order in which the rows have been included as part of the example aggregated data 205, however, has been randomized In this way, the randomized process has resulted in the second row 203-r2 of the example second data record 203 being placed between the first row 201-r1 and the second row 201-r1 of the example first data record 201. The randomized process further resulted in the first row 203-r1 of the example first data record 201 being placed after the second row 201-r1 of the example first data record 201. This ordering is only one example of the randomization that may occur as the result of the randomized data record aggregation process. For example, the rows could be ordered differently (e.g., row 203-r1, followed by row 201-r2, followed by row 201-r1, and followed by row 203-r2). Alternatively or additionally, the columns of the data records may be randomized, which may result in the columns being ordered in a randomized fashion.
The randomization of the order in which the rows or columns are included as part of the example aggregation data 205 may improve data privacy. For example, by randomizing the order, data received from the two data sources 102, 103 may be mixed together in such a way that it may be more difficult to determine which source sent a particular piece of data. Indeed, by randomizing the order of the rows 201-r1, 201-r2, 203-r1, 203-r2 from the example first data record 201 and the example second data record 203, it may be more difficult to determine which source sent a particular row of data in the example aggregated data 205 as compared to some alternative processes insofar as there is not pre-set location where rows 201-r1, 201-r2, 203-r1, 203-r2 can be found within the example aggregated data 205. Compare the lack of a pre-set location in the randomized process to an alternative process that always appends data received from the second data source 103 directly after the data received from the first data source 102. In such instances using this alternative process, there would be a pre-set location to look for the data received from the two data sources 102, 103 and, thus, it may be easier to determine which source sent a particular row of data.
After determining the aggregated data 111, the computing platform 110 may, based on the aggregated data 111, determine first selected data 113 and second selected data 115. The determination of the first selected data 113 may be based on a first selecting process 110-2 and the determination of the second selected data 115 may be based on a second selecting process 110-3. Each of the first selecting process 110-2 and the second selecting process 110-3 may select a subset of data from the aggregated data 111 that will be used to train a model at one of the two data sources 102, 103. Accordingly, the first selected data 113 may include a first subset of data from the aggregated data 111 and the second selected data 115 may include a second subset of data from the aggregated data 111. For example, the computing platform 110 may first determine which data sources will be sent selected data. The computing platform 110 may determine to send selected data to all or only some of the data sources. For example, as depicted in
The above example may illustrate when all data sources are sent selected data. As an alternative example, a third data source (not should) may have sent data records in addition to the two depicted data sources 102, 103. The computing platform 110 may determine to not send selected data to the third data source. This determination may be performed, for example, by a randomized process (e.g., randomly determine to send or not send to each data source); by a data source not having sent greater than a threshold number of data records (e.g., the third data source may have send less than 2 data records); by a periodic schedule (e.g., the third data source may be sent selected data every other time); by a data source's availability via a network (e.g., the second data source may be offline when selected data is to be sent and not sent data records, but portions of the second data source's data records may still be processed by the other data sources based on the portions inclusion in the selected data); or by some other criteria.
As depicted, the example first selected data 207 and the example second selected data 209 are formatted in rows and columns. The example first selected data 207 has four rows 207-r1 to 207-r4 and eight columns 207-c1 to 207-c8. The example second selected data 209 has four rows 209-r1 to 209-r4 and eight columns 209-c1 to 209-c8. The number of rows and columns in each of the example first selected data 207 and the example second selected data 209 may be the same as the number of rows and columns in the example aggregated data 205. The order of the rows and columns in each of the example first selected data 207 and the example second selected data 209 may be the same as the order of the rows and columns in the example aggregated data 205.
As depicted, the example first selected data 207 includes a first subset of data that was selected from the example aggregated data 205 by the first selecting process 110-2. Accordingly, the example first selected data 207 includes the data value, in the same row and column, for any cell of the example aggregated data 205 that was selected by the first selecting process 110-2. For example, because the cell at first row and the first column of the example aggregated data 205 was selected, the example first selected data 207 includes the data value H(a) in the cell at its first column 207-c1 and its first row 207-r1. The example first selected data 207 includes a blank cell for any cell that was not selected by the first selecting process 110-2. For example, because the cell at the first row and the second column of the example aggregated data 205 was not selected, the example first selected data 207 includes a blank cell at its second column 207-c2 and its first row 207-r1. If the example first selected data 207 is compared to the example aggregated data 205, the example first selected data 207 is shown as having more blank cells than the example aggregated data 205. In this way, the first selecting process 110-2 may result in the example first selected data 207 excluding, by way of one or more blank cells, one or more cells of the example aggregated data 205.
As depicted, the example second selected data 208 includes a second subset of data that was selected from the example aggregated data 205 by the second selecting process 110-3. Accordingly, the example second selected data 209 includes the data value, in the same row and column, for any cell of the example aggregated data 205 that was selected by the second selecting process 110-3. For example, because the cell at first row and the third column of the example aggregated data 205 was selected, the example second selected data 209 includes the data value b in the cell at its third column 209-c3 and its first row 209-r1. The example second selected data 209 includes a blank cell, or a cell having no data, for any cell that was not selected by the second selecting process 110-3. For example, because the cell at the first row and the first column of the example aggregated data 205 was not selected, the example second selected data 209 includes a blank cell at its first column 207-c 1 and its first row 207-r1. If the example second selected data 209 is compared to the example aggregated data 205, the example second selected data 209 is shown as having more blank cells than the example aggregated data 205. In this way, the second selecting process 110-3 may result in the example second selected data 209 excluding, by way of one or more blank cells, one or more cells of the example aggregated data 205.
If the example first selected data 207 is compared to the example second selected data 209, the comparison shows that examples 207, 209 include some cells that were selected by both the first selecting process 110-2 and the second selecting process 110-3. In other words, both the example first selected data 207 and the example second selected data 209 include data values in the same cells. For example, the example first selected data 207 and the example second selected data 209 both include the data value d in the cell at the first row and sixth column. Any cell that is included in both the first selected data 207 and the second selected data 209 may be referred to as an overlapping cell.
If the example first selected data 207 is compared to the example second selected data 209, the comparison shows that the examples 207, 209 include some cells that were selected by one, but not both, of the first selecting process 110-2 and the second selecting process 110-3. In other words, the example first selected data 207 includes one or more first cells that is blank in the example second selected data 209. The example second selected data 209 includes one or more second cells that is blank in the example first selected data 209. For example, the example first selected data 207 includes the data value c in the cell at the first row and fifth column, while the example second selected data 209 has a blank cell at the first row and fifth column. As another example, the example second selected data 209 includes the data value H(t) in the cell at the second row and first column, while the example first selected data 207 has a blank cell at the second row and first column. Any cell that is included in one of the first selected data 207 and the second selected data 209, but not the other, may be referred to as a non-overlapping cell.
If the example first selected data 207 is compared to the example second selected data 209, the comparison shows that the examples 207, 209 can be combined together to regenerate the example aggregated data 205. In other words, each cell of the example aggregated data 205 is included by at least one of the example first selected data 207 and the example second selected data 209. In this way, the first selecting process 110-2 and the second selecting process 110-3 may result in selected data 207, 209 that, together, are usable to regenerate the example aggregated data 205 from which they were selected from.
The above discussion regarding the example first selected data 207 and the example second selected data 209 provides a few examples regarding how the selecting processes 110-2 and 110-3 can be performed. Table I provides a summary of those examples as well as additional examples of how the selecting processes 110-2 and 110-3 can be performed. In particular, each row of Table I provides an example selecting process and a description of selected data that may result from the example selecting process. Each example of Table I may be used on its own as a basis for selecting process 110-2, 110-3. Each example of Table I may be combined with one or more other examples of Table I and used as a basis for selecting process 110-2, 110-3. Additionally, the two selecting processes 110-2, 110-3 may be different from each other insofar as each uses a different combination of the examples listed in Table I.
After determining the first selected data 113, the computing platform 110 may send the first selected data 113 to the first data source 102. After receiving the first selected data 113, the first data source 102 may train the first model 116 using the first selected data 115. Any suitable training technique may be used. After the first model 116 is trained, the first model 116 may be used to determine first prediction data 118. The first prediction data 118 may be one or more predictions of user behavior based on the first selected data 113 being a subset of data selected from data indicative of banking and credit card transactions. Additionally, after the first model 116 is trained, the first data source 102 may determine first configuration information 120 for the first model 116 (e.g., one or more first model weights and/or one or more first biases) and may send the first configuration information 120 to the computing platform 110. Further, by sending the first configuration information 120 to the computing platform 110, the first data source 102 may be able to prevent the computing platform 110 from being aware of the training algorithm used to train the first model 117.
After determining the second selected data 115, the computing platform 110 may send the second selected data 115 to the second data source 103. After receiving the second selected data 115, the second data source 103 may train the second model 117 using the second selected data 115. Any suitable training technique may be used and may be a different training technique than was used to train the first model 116. After the second model 117 is trained, the second model 117 may be used to determine second prediction data 119. The second prediction data 119 may be one or more predictions of user behavior based on the second selected data 115 being selected from data indicative of banking and credit card transactions. Additionally, after the second model 117 is trained, the second data source 103 may determine second configuration information 121 for the second model 117 (e.g., one or more second model weights and/or one or more second biases) and may send the second configuration information 121 to the computing platform 110. Further, by sending the second configuration information 121 to the computing platform 110, the second data source 103 may be able to prevent the computing platform 110 from being aware of the training algorithm used to train the second model 117.
The models 116, 117 of the two data sources 102, 103 may or may not have restricted access. For example, if access is not restricted to the first model 116, any application or user associated with the first data source 102 may access the first model 116, the first configuration information 119, and/or the first prediction data 116. If access is restricted to the first model 117, only a single application executed by the first data source 102 may have access to the first model 116, the first configuration information 120, and/or the first prediction data 116. Further, the single application may prevent the first model 116, the first configuration information 120, and/or the first prediction data 116 from being accessed, or used, by any other application or user associated with the first data source 102. Indeed, the single application may only allow for the first configuration information 120 to be sent to the computing platform 110.
After receiving the first configuration information 120 and the second configuration information 121, the computing platform 110 may perform a configuration information aggregation process 110-4 that, in part, determines aggregated configuration information 123 for the aggregated model 125 (e.g., one or more aggregated model weights and/or one or more aggregated biases). The configuration information 123 may, for example, combine the first configuration information 120 with the second configuration information 121 using various aggregation techniques (e.g., by summing, normalizing, and/or some other process). The configuration information aggregation process 110-4 may be based on any indications of overlapping and non-overlapping cells. The indications of overlapping and non-overlapping cells may be used to increase or decrease the significance of a model parameter (e.g., increase or decrease the significance of a model weight and/or a bias). For example, if the first selected data 113 and the second selected data 115 have greater than a threshold number of overlapping cells, model weights may be reduced so that they have less influence over the configuration of the aggregated model 125. This reduction may result, for example, because the data sources are regionally-redundant systems that perform redundant processing of transactions and causes duplication of the transactions across the two sources. The impact of the duplication can be lessened by reducing the model weights. As another example, if the first selected data 113 and the second selected data 115 have greater than a threshold number of overlapping cells, model weights may be increased so they have more influence over the configuration of the aggregated model 125. This increase may result, for example, because the data records include many repeated transactions between the same accounts (e.g., transactions representing a monthly bill for the same monthly cost). The impact of the repeated transaction can be increased by increasing the model weights. The computing platform 110 may then configure the aggregated model 125 using the aggregated configuration information 123. After the aggregated model 125 is configured, the aggregated model 125 may be usable to determine aggregated prediction data 126. The aggregated prediction data 126 may be one or more predictions of user behavior based on the aggregated data 111 being indicative of banking and credit card transactions.
Having discussed the example computing environment 100 of
The example process flow begins at 310, where computing platform 310 may perform one or more registration processes with the first, second, third, and fourth data sources. A registration process may be performed for each data source that will be in communication with the computing platform 309. A registration process may, among other things, identify a data source as being available for receiving selected data, identify one or more computing devices as being associated with a data source, and may provide information related to the data records that the data source will be sending to the computing platform 309. In this way, the computing platform 309 may be able to, for example, determine whether data records align with each other or not, and may be able to send selected data to any registered data source. As such, a registration process may include one or more communications between the computing platform 309 and any of the one or more first computing devices 301, the one or more second computing devices 303, the one or more third computing devices 305, and the one or more fourth computing devices 307. For example, as part of a registration process for the first data source, the one or more first computing devices 301 may send, to the computing platform 309, a registration request for the first data source. The registration request may include address information for the one or more first computing devices 301, may include an identifier for the first data source, and may include information indicative of the data records the first data source may send to the computing platform 309. The information indicative of the data records may include, for example, an indication of the types of data included by the data records (e.g., a data record includes data indicative of transactions using a first credit card); an indication of the format of the data records (e.g., a number of the rows and/or columns); information indicative of the order in which the data records include the data (e.g., an indication of the order of the columns), an indication of whether the data records include confidential data, an indication of one or more confidentiality procedures associated with the first data source, and the like. Based on the registration request, the computing platform 309 may update a data structure indicative of the registered data sources for later use. Additionally, the computing platform 309 may send data for initializing the model at the first data source. For example, the computing platform 309 may send an untrained model to the one or more first computing devices 301 and this untrained model may be used as the first data source's model (e.g., the first model 115 of
After the data sources have been registered, the example flow may continue at 311-317, which depicts the data sources sending data records to the computing platform 309. In particular, at 311, the one or more first computing devices 301 may send, to the computing platform 309, one or more first data records associated with the first data source. At 313, the one or more second computing devices 303 may send, to the computing platform 309, one or more second data records associated with the second data source. At 315, the one or more third computing devices 305 may send, to the computing platform 309, one or more third data records associated with the third data source. At 317, the one or more fourth computing devices 307 may send, to the computing platform 309, one or more fourth data records associated with the fourth data source. Each of the one or more first, second, third, and fourth data records may be the same as, or similar to, the data records discussed in connection with
At 319, the computing platform 309 may perform data pre-processing on one or more of the first, second, third, and fourth data records. The data pre-processing may include hashing or encrypting confidential data to prevent the confidential data from being disclosed and/or may include one or more validity processes to determine whether data values of the data records are valid. The data pre-processing may be performed the same as, or similar to, the data pre-processing that was discussed in connection with
The example flow continues at 321 of
At 323-329 of
At 331 of
As one example, the computing platform 309 may determine that the first selected data and the second selected data have one or more overlapping cells. The computing platform 309 may also determine that the third selected data and the fourth selected data are without the one or more overlapping cells and/or without any overlapping cells. The computing platform 309 may also determine that the first selected data has one or more first non-overlapping cells, that the second selected data has one or more second non-overlapping cells, that the third selected data has one or more third non-overlapping cells, and that the fourth selected data has one or more fourth non-overlapping cells.
At 333, the computing platform 309 may store one or more indications of the overlapping and/or the non-overlapping cells. These indications may be stored for later use by the computing platform 309. As depicted by the example flow, this storing is performed as a process separate from the four selecting processes.
Continuing the example of 331, the computing platform 309 may store an indication that the first selected data and the second selected data have the one or more overlapping cells. The computing platform 309 may store an indication that the third and fourth selected data are without the one or more overlapping cells and/or without any overlapping cells. The computing platform 309 may store an indication of the one or more first non-overlapping cells, the one or more second non-overlapping cells, the one or more third non-overlapping cells, and the one or more fourth non-overlapping cells.
At 335, 337, 339, and 341 of
At 336, 338, 340, and 342 of
At 343-346, the four data sources may determine the configuration information of their models. Accordingly, at 343, the one or more first computing devices 301 may determine first configuration information of the first model (e.g., one or more first model weights and/or one or more first biases). At 344, the one or more second computing devices 303 may determine second configuration information of the second model (e.g., one or more second model weights and/or one or more second biases). At 345, the one or more third computing devices 305 may determine third configuration information of the third model (e.g., one or more third model weights and/or one or more third biases). At 347, the one or more fourth computing devices 307 may determine fourth configuration information of the fourth model (e.g., one or more fourth model weights and/or one or more fourth biases). Each of the determined configuration information may be the same as, or similar to, the configuration information 120, 121 of
At 347-353 of
The example flow does not show the four data sources as determine predictions using their models. For example, the example flow does not explicitly show the one or more first computing devices 301 as determining first prediction data using the first model. This may be considered as a way to illustrate that the four data sources have restricted access to their models. In this way, each of the data sources may be prevented from using their models to make predictions.
At 355, the computing platform 355 may determine, based on a configuration information aggregation process, aggregated configuration information for an aggregated model (e.g., one or more aggregated model weights and/or one or more aggregated biases). This determination may be performed based on the first, second, third, and fourth configuration information. For example, the one or more first model weights, the one or more second model weights, the one or more third model weights, and the one or more fourth model weights may be summed together to determine the one or more aggregated model weights. The one or more first biases, the one or more second biases, the one or more third biases, and the one or more fourth biases may processed based on an or operator or an exclusive or operator to determine the one or more aggregated biases. This determination may also be performed based on any indication of overlapping and/or non-overlapping cells, which were stored at 333 of the example flow. The configuration information aggregation process may be the same as, or similar to, the configuration information aggregation process discussed in connection with
At 357, the computing platform 309 may configure the aggregated model using the aggregated configuration information. This configuration of the aggregated model may be performed the same as, or similar to, the manner in which the aggregated model 125 of
At 359, the computing platform 309 may determine, based on the aggregated model, aggregated prediction data. This determination may be performed the same as, or similar to the manner in which the aggregated model 125 of
At 361, the computing platform 309 may send, based on the aggregated prediction data, one or more messages. These messages may, for example, provide indications of the aggregated prediction data to various entities. These entities may be any of the four data sources or some other computing device that is not associated with any of the four data sources. As such, the example flow illustrates each of the four data sources being sent the one or more messages.
In view of the example flow of
At step 405, the one or more computing devices may receive, from each of a plurality of data sources, a data record, resulting in a plurality of data records associated with the plurality of data sources. Each data record may be the same as, or similar to, the data records discussed in connection with
At step 410, the one or more computing devices may perform data pre-processing on the plurality of data records. The data pre-processing may be the same as, or similar to, the data pre-processing discussed in connection with
At step 415, the one or more computing devices may determine, based on a randomized data aggregation process, aggregated data. The randomized data aggregation process may be the same as, or similar to, the randomized aggregation process discussed in connection with
At step 420, the one or more computing devices may perform, for each of the plurality of data sources, a selecting process on the aggregated data. Performance of these selecting processes may result in selected data for each of the plurality of data sources (e.g., first selected data for a first data source, second selected data for a second data source, and the like). The selecting processes may be the same as, or similar to, the selecting processes discussed in connection with
Additionally, as part of one or more of the selecting processes performed at step 420, the one or more computing devices may determine and store indications of overlapping and/or non-overlapping cells. This determination may be performed the same as, or similar to, the determination of overlapping and/or non-overlapping cells discussed in connection with
At step 425, the one or more computing devices may send the selected data for each of the plurality of data sources. In this way, each of the plurality of data sources may be sent its associated selected data. For example, the one or more computing devices may send first selected data to a first data source and may send second selected data to a second data source. This sending may be performed the same as, or similar to, the sending of selected data discussed in connection with
At step 430, the one or more computing devices may receive, from each of the plurality of data sources, model weights. This receiving may result in the one or more computing devices receiving a plurality of model weights associated with the plurality of data sources. Indeed, the plurality of model weights may include one or more model weights for each of the plurality of data sources (e.g., one or more first model weights for a first data source, one or more second model weights for a second data source, and the like). The receiving and the model weights may be the same as, or similar to, those discussed in connection with
At step 435, the one or more computing devices, may determine, based on the plurality of model weights and a model weight aggregation process, one or more aggregated model weights for an aggregated model. The model weight aggregation process and the aggregated model may be the same as, or similar to, the manner in which model weights are aggregated during the configuration information aggregation process discussed in connection with
At step 440, the one or more computing devices may configure the aggregated model using the one or more aggregated model weights. This configuration may be performed the same as, or similar to, the manner in which model weights are used to configure the aggregated models of
The computing device 501 may operate in a standalone environment or a networked environment. As shown in
As seen in
Devices 505, 507, 509 may have similar or different architecture as described with respect to computing device 501. Those of skill in the art will appreciate that the functionality of computing device 501 (or device 505, 507, 509) as described herein may be spread across multiple data processing devices, for example, to distribute processing load across multiple computers, to segregate transactions based on geographic location, user access level, quality of service (QoS), etc. For example, devices 501, 505, 507, 509, and others may operate in concert to provide parallel computing features in support of the operation of control logic 525.
One or more aspects discussed herein may be embodied in computer-usable or readable data and/or computer-executable instructions, such as in one or more program modules, executed by one or more computers or other devices as described herein. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device. The modules may be written in a source code programming language that is subsequently compiled for execution, or may be written in a scripting language such as (but not limited to) HTML or XML. The computer executable instructions may be stored on a computer readable medium such as a hard disk, optical disk, removable storage media, solid state memory, RAM, etc. As will be appreciated by one of skill in the art, the functionality of the program modules may be combined or distributed as desired in various embodiments. In addition, the functionality may be embodied in whole or in part in firmware or hardware equivalents such as integrated circuits, field programmable gate arrays (FPGA), and the like. Particular data structures may be used to more effectively implement one or more aspects discussed herein, and such data structures are contemplated within the scope of computer executable instructions and computer-usable data described herein. Various aspects discussed herein may be embodied as a method, a computing device, a data processing system, or a computer program product.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in any claim is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing any claim or any of the appended claims.