Unless otherwise indicated, the subject matter described in this section is not prior art to the claims of the present application and is not admitted as being prior art by inclusion in this section.
Federated learning is a machine learning (ML) paradigm that enables multiple parties to jointly train an ML model on training data that is distributed across the parties while keeping the data samples local to each party secret/private. For example, consider a scenario in which three organizations O1, O2, and O3 hold local datasets D1, D2, and D3 respectively and would like to train a global ML model M on the aggregation of D1, D2, and D3. Federated learning provides protocols that allows O1, O2, and O3 to achieve this without revealing the contents of D1, D2, and D3 to each other or to any other entity.
There are two types of federated learning that are distinguished by the structure of the local datasets held by the parties: sample-partitioned (also known as horizontal or homogenous) federated learning and feature-partitioned (also known as vertical or heterogeneous) federated learning. With sample-partitioned federated learning, the parties' local datasets share a consistent feature set (i.e., data schema) and contain different data samples pertaining to that consistent data schema. For instance, in the example above with organizations O1, O2, and O3, datasets D1, D2, and D3 may share a data schema comprising features [SSN, name, age] and each dataset may include data samples for different individuals with appropriate values for these features (e.g., [111-11-1111, “Bob Smith”, 39], [222-22-2222, “Ann Jones”, 54], etc.).
With feature-partitioned federated learning, the parties' local datasets include different features and thus have different data schemas, but also include at least one common feature can that be used to associate (i.e., join) the data samples across the local datasets and thereby tie those data samples together for training purposes. For instance, in the example above dataset D1 may include the data schema [SSN, name, age], dataset D2 may include the data schema [SSN, height, weight, eye color, hair color], and dataset D3 may include the data schema [SSN, credit score, household income]. In this case, during the training of global ML model M, the data samples in D1, D2, and D3 can be joined using common feature SSN in order to create composite data samples that include all of the features of D1, D2, and D3 for each unique SSN value.
A significant challenge with implementing federated learning in the real world is that, due to its decentralized nature and the need to maintain data privacy, it is difficult for potential parties to discover/understand what types of training data are available across an alliance of such entities and to manage the use/contribution of their respective local datasets for specific training tasks. This is particularly problematic for feature-partitioned federated learning because the data schemas across parties will differ and thus one party will not know (and cannot easily infer) the features that may be present in the local datasets of other parties.
In the following description, for purposes of explanation, numerous examples and details are set forth in order to provide an understanding of various embodiments. It will be evident, however, to one skilled in the art that certain embodiments can be practiced without some of these details or can be practiced with modifications or equivalents thereof.
Embodiments of the present disclosure are directed to a collaborative data schema management system for federated learning, referred to herein as “federated data manager” (FDM). Among other things, FDM enables the members of a federated learning alliance (e.g., organizations, data scientists, etc.) to (1) propose data schemas for use by the alliance, (2) identify and bind local datasets to proposed schemas, (3) create, based on the proposed schemas, training datasets for addressing various ML tasks, and (4) control, for each training dataset, which of the local datasets bound to that training dataset (and thus, which alliance members) will actually participate in the training of a particular ML model. Significantly, FDM enables these features while ensuring that the contents (i.e., data samples) of the members' local datasets remain hidden from each other, thereby preserving the privacy of that data.
In other embodiments, alliance members 104(1)-(N) may be companies in different industries with local datasets that have different data schemas, but some degree of feature overlap. For example, alliance member 104(1) may be an e-commerce company with a local database storing the purchase histories of its customers and alliance member 104(2) may be a bank with a local database storing the financial records (e.g., account balances, etc.) of those same (and other) customers. In these embodiments, alliance members 104(1)-(N) may collaborate with each other to train ML models via feature-partitioned federated learning, such that each member contributes local training data that conform to different data schemas but can be join together via at least one common feature (e.g., customer identifier (ID)).
As noted in the Background section, a significant challenge with implementing federated learning across a federation/alliance like alliance 102 of
To address the foregoing and other similar issues, environment 100 implements a novel data schema management system for federated learning—shown as federated data manager (FDM) 116—that is composed of an FDM server 110 and FDM database 112 running on central server 106 and an FDM client 114 running on the local computer system(s) of each alliance member 104. Generally speaking, FDM 116 enables alliance members 104(1)-(N) to collaboratively propose data schemas (in the form of “data schema objects”) for use by federated learning alliance 102; identify and bind (in the form of “schema data bindings”) their local datasets to the proposed schemas; create/propose, based on the proposed schemas, training datasets (in the form of “training dataset objects”) for addressing various ML tasks; and control which local datasets/alliance members actually participate in specific training runs via the training dataset objects. This is all achieved while keeping the alliance members' local datasets at their respective local premises, thereby preserving data privacy. Accordingly, FDM 116 solves the data management challenges of federated learning (and in particular, feature-partitioned federated learning) in a structured and secure fashion.
Starting with steps 202 and 204, one or more alliance members 104(1)-(N) can propose data schemas for use within federated learning alliance 102 and FDM server 110 can create and store data schema objects corresponding to the proposed data schemas in FDM database 112. As mentioned previously, a data schema is a set of features (also known as attributes or columns) that represents the types of data that are part of each data sample in a dataset. For instance, an example data schema S1 may include the features [SSN, name, age] where SSN and name are strings and age is an integer.
At step 206, one or more alliance members 104(1)-(N) can query the data schema objects that have been proposed/created and can identify local datasets that match (i.e., include the same or similar features as) those data schema objects. The alliance members can then instruct FDM server 110 to create associations (i.e., schema data bindings) between the matching (local dataset, data schema object) pairs (step 208). In this way, the alliance members can preliminarily “contribute” their local datasets to the proposed data schemas for federated learning purposes. In response, FDM server 110 can create and store the schema data bindings in FDM database 112 (step 210). Note that one data schema object can be bound to multiple local datasets (either from the same or different alliance members), which means that the data schema object can have multiple potential data sources.
For example, an alliance member 104(1) may query data schema S1 noted above, determine that S1 has the same features as a local dataset D1, and instruct FDM server 110 to create a first schema data binding B1 for S1 that binds it to D1. Similarly, an alliance member 104(2) may query data schema S1, determine that S1 has the same features as a local dataset D2, and instruct FDM server 110 to create a second schema data binding B2 for S1 that binds it to D2 . Significantly, each schema data binding created and stored at step 212 can include a reference to its corresponding local dataset (e.g., connection endpoint information), but not the actual content (i.e., data samples) of that local dataset. This ensures that those data samples remain on the premises of each member 104 and thus is not revealed to central server 106 or the other alliance members.
At step 212, a data scientist associated with federated learning alliance 102 can query the data schema objects in FDM database 112 and propose a training dataset for solving an ML task that includes some subset of the data schema objects. As used herein, this “data scientist” is any individual or entity that can identify ML tasks and propose training datasets for addressing the identified tasks. For example, in one set of embodiments the data scientist may be a person (or group of people) affiliated with federated learning alliance 102, such as employees of one or more alliance members. In another set of embodiments, this data scientist may comprise one or more automated programs/agents.
If the ML task identified by the data scientist involves training an ML model via sample-partitioned federated learning, this proposed training dataset can include a reference to a single data schema object which comprises the feature set to be included in the training dataset. Alternatively, if the ML task involves training of an ML model via feature-partitioned federated learning, this proposed training dataset can include references to multiple data schema objects, as well as a “join” feature/column that is common to all of the data schema objects and is intended to join those data schemas together. At step 214, FDM server 110 can create and store a training dataset object corresponding to the proposed training dataset in FDM database 112.
Each alliance member 104 can thereafter query the training dataset object (step 216) and check whether it has a local dataset bound to a data schema object in the training dataset object via a previously-created schema data binding (step 218). If the answer is yes, the alliance member can choose to participate (or not participate) in the training of an ML model using that training dataset object with its local dataset (step 220). If the alliance member does choose to participate, FDM server 110 can update the corresponding schema data binding in FDM database 112 with an appropriate flag/indicator (e.g., a participateInDataset flag) (not shown).
Once the alliance members have chosen their participation preferences with respect to the training dataset object, the data scientist that originally proposed the training dataset can select, from among the schema data bindings with the participateInDataset flag set to true, a subset of those schema data bindings to actually use (and thus participate) in the training of a particular ML model M via federated learning (step 222). For example, if the training dataset object include a reference to a data schema object S1 and S1 has two different schema data bindings B1 and B2 with participateInDataset=true (indicating that the alliance members owning B1 and B2 have chosen to participate with these bindings), the data scientist can select B1 alone, B2 alone, or both B1 and B2 for use in training ML model M. In response, FDM server 110 can create and store a training configuration object in FDM database 112 that includes references to the training dataset object and the schema data bindings selected at step 222 (step 224).
Finally, the data scientist or some other entity can initiate training of the specified ML model M using the training configuration object (step 226) and workflow 200 can end.
The remaining sections of this disclosure provide additional details for implementing certain portions of high-level workflow 200 according to various embodiments, such as application programming interface (API) invocations and other actions that may be performed by FDM clients 114(1)-(N) and FDM server 110 for creating data schema objects, schema data bindings, training dataset objects, and training configuration objects. It should be appreciated that
Starting with step 302, the FDM client can invoke an API exposed by the FDM server for proposing a new data schema (e.g., ProposeSchema( )) for use within federated alliance 102. In one set of embodiments, this ProposeSchema API can take as input the following parameters:
At step 304, the FDM server can receive the API invocation and can select a unique ID for the proposed data schema. FDM server can then create and store a new data schema object in FDM database 112 with the selected ID and the schema metadata provided with the API invocation (step 306).
Starting with step 402, the FDM client can identify a local dataset of the alliance member whose features generally match the features of data schema object S (e.g., include the same or similar/compatible feature names/descriptions and data types). For example, the identified local dataset can include a feature set that is identical to, or is a superset of, the feature set of S.
At step 404, the FDM client can invoke an API exposed by the FDM server for binding the identified local dataset to data schema object S (e.g., BindDataToSchema( )). In one set of embodiments, this BindDataToSchema API can take as input the following parameters:
At step 406, the FDM server can receive the API invocation. FDM server can then create and store a new schema data binding in FDM database 112 with the metadata provided with the API invocation (step 506). In certain embodiments, the created schema data binding can also include other fields, such as the participateInDataset flag mentioned previously (step 408).
Starting with step 502, the FDM client can invoke an API exposed by the FDM server for proposing a new training dataset that includes the data schemas embodied by [S1, . . . , Sm] (e.g., ProposeDataset( )) for use within federated alliance 102. In one set of embodiments, this ProposeDataset API can take as input the following parameters:
At step 504, the FDM server can receive the API invocation. FDM server can then create and store a new training dataset object T in FDM database 112 with the metadata provided with the API invocation (step 506).
At step 508, the FDM client can query, from the FDM server, the schema data bindings associated with each data schema object [S1, . . . , Sm] in training dataset object T. The FDM client can further check the participateInDataset flag of each queried schema data binding to determine whether the alliance member that owns the schema data binding has agreed to participate in federated learning using T (step 510).
Upon checking the participateInDataset flags, the FDM client can invoke an API exposed by the FDM server for creating a new training configuration that includes a selected subset of the schema data bindings with participateInDataset=true (e.g., CreateTrainingConfig( )) (step 512). This selected subset comprises schema data bindings that the owner of the FDM client (e.g., a data scientist) has determined should be used in the training of a particular ML model M. In one set of embodiments, the CreateTrainingConfig API can take as input the following parameters:
At step 514, the FDM server can receive the API invocation and can select a unique ID for the training configuration. Finally, FDM server can create and store a new training configuration object in FDM database 112 with the selected ID and the metadata provided with the API invocation (step 516). Although not shown, once the training configuration object is created, it can be used to initiate training of a specified ML model. As part of this training process, the schema data bindings that are referenced within the training configuration object can be used to automatically identify the alliance members and corresponding local datasets that will participate in the training.
Certain embodiments described herein can employ various computer-implemented operations involving data stored in computer systems. For example, these operations can require physical manipulation of physical quantities—usually, though not necessarily, these quantities take the form of electrical or magnetic signals, where they (or representations of them) are capable of being stored, transferred, combined, compared, or otherwise manipulated. Such manipulations are often referred to in terms such as producing, identifying, determining, comparing, etc. Any operations described herein that form part of one or more embodiments can be useful machine operations.
Further, one or more embodiments can relate to a device or an apparatus for performing the foregoing operations. The apparatus can be specially constructed for specific required purposes, or it can be a generic computer system comprising one or more general purpose processors (e.g., Intel or AMD x86 processors) selectively activated or configured by program code stored in the computer system. In particular, various generic computer systems may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations. The various embodiments described herein can be practiced with other computer system configurations including handheld devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
Yet further, one or more embodiments can be implemented as one or more computer programs or as one or more computer program modules embodied in one or more non-transitory computer readable storage media. The term non-transitory computer readable storage medium refers to any storage device, based on any existing or subsequently developed technology, that can store data and/or computer programs in a non-transitory state for access by a computer system. Examples of non-transitory computer readable media include a hard drive, network attached storage (NAS), read-only memory, random-access memory, flash-based nonvolatile memory (e.g., a flash memory card or a solid state disk), persistent memory, NVMe device, a CD (Compact Disc) (e.g., CD-ROM, CD-R, CD-RW, etc.), a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The non-transitory computer readable media can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.
Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations can be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component can be implemented as separate components.
As used in the description herein and throughout the claims that follow, “a,” “an,” and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The above description illustrates various embodiments along with examples of how aspects of particular embodiments may be implemented. These examples and embodiments should not be deemed to be the only embodiments and are presented to illustrate the flexibility and advantages of particular embodiments as defined by the following claims. Other arrangements, embodiments, implementations, and equivalents can be employed without departing from the scope hereof as defined by the claims.
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2021/137855 | Dec 2021 | WO | international |
The present application is a continuation of International Application No. PCT/CN2021/137855 filed Dec. 14, 2021, entitled “Collaborative Data Schema Management for Federated Learning,” the entire contents of which are incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2021/137855 | Dec 2021 | US |
Child | 17580574 | US |