This application claims the benefit of European Application No. 20306478.7 filed Dec. 1, 2020, and the entire content is hereby incorporated by reference.
This invention relates generally to machine learning and more particularly to administrating a federated machine learning network.
Machine Learning (ML) is a promising field with many applications; organizations of all sizes are practicing ML, from individual researchers to the largest companies in the world. In doing so, ML processes consume an extremely large amount of data. Indeed, ML models require large amounts of data to learn from examples efficiently. In ML, more data often leads to better predictive performance, which measures the quality of an ML model. Usually, different sources, such as users, patients, measuring devices, etc., produce data in a decentralized way. This source distribution makes it difficult for a single source to have enough data for training accurate models. Currently, the standard methodology for ML is to gather data in a central database. However, these practices raise important ethical questions which ultimately could limit the potential social benefits of ML.
However, data used for training models can be sensitive. In the case of personal data, which are explicitly related to an individual, the privacy of individuals is at stake. Personal data is particularly useful and valuable in the modern economy. With personal data it is possible to personalize services, which has brought much added value to certain applications. This can involve significant risks if the data are not used in the interest of the individual. Not only should personal data be secured from potential attackers, but their use by the organization collecting them should also be transparent and aligned with user expectations. Beyond privacy, data can also be sensitive when it has economic value. Information is often confidential and data owners want to control who accesses it. Examples range from classified information and industrial secret to strategic data which can give an edge in a competitive market. From the perspective of tooling, preserving privacy and preserving confidentiality can be similar and both differ mostly in the lack of regulation covering the latter.
Thus, there is a tradeoff between predictive performance improvement versus data privacy and confidentiality. ML always needs more data, but data tends to be increasingly more protected. The centralization paradigm where a single actor gathers all data on its infrastructure is reaching its limit.
A relevant way to solve this tradeoff lies in distributing computing and remote execution of ML tasks. In this approach, the data themselves never leave their nodes. In ML, this includes Federated learning: each dataset is stored on a node in a network, and only the algorithms and predictive models are exchanged between them. This immediately raises the question of the potential information leaks in these exchanged quantities, including a trained model. The research on ML security and privacy has seen a significant increase in recent years covering topics from model inversion and membership attacks to model extraction. A residual risk is that data controllers still have to trust a central service orchestrating federated learning, and distributing models and metadata across the network.
A method and apparatus of a device that trains a model is described. In an exemplary embodiment, the device creates a loop network between a central aggregating node and a set of one or more worker nodes, where the loop network communicatively couples the central aggregating node and the set of one or more worker nodes. The device further receives and broadcasts a model training request from one of the nodes in the loop network to one or more other nodes in the loop network.
In a further embodiment, a device that evaluates a model is described. In one embodiment, the device creates a loop network between a central aggregating node and a set of one or more worker nodes, where the loop network communicatively couples the central aggregating node and the set of one or more worker nodes. In addition, the device receives and broadcasts a model evaluation request for the model from the central aggregating node to one or more worker nodes.
Other methods and apparatuses are also described.
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements.
A method and apparatus of a device that creates a loop network for training a model is described. In the following description, numerous specific details are set forth to provide thorough explanation of embodiments of the present invention. It will be apparent, however, to one skilled in the art, that embodiments of the present invention may be practiced without these specific details. In other instances, well-known components, structures, and techniques have not been shown in detail in order not to obscure the understanding of this description.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification do not necessarily all refer to the same embodiment.
In the following description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. “Coupled” is used to indicate that two or more elements, which may or may not be in direct physical or electrical contact with each other, co-operate or interact with each other. “Connected” is used to indicate the establishment of communication between two or more elements that are coupled with each other.
The processes depicted in the figures that follow, are performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), software (such as is run on a general-purpose computer system or a dedicated machine), or a combination of both. Although the processes are described below in terms of some sequential operations, it should be appreciated that some of the operations described may be performed in different order. Moreover, some operations may be performed in parallel rather than sequentially.
The terms “server,” “client,” and “device” are intended to refer generally to data processing systems rather than specifically to a particular form factor for the server, client, and/or device.
A method and apparatus of a device that creates a loop network for training a model is described. In one embodiment, the device acts as a master node that couples to a set of central aggregators and a set of worker nodes over a master node network. In one embodiment, the master node allows for the set of central aggregators and a set of worker nodes to communicate with the master node for the purposes of orchestrating loop networks, but the worker nodes are not visible with the central aggregators via the master node networks. For example and in one embodiment, the Internet Addresses (IP) of the worker's node are kept private from central aggregators, so that the central aggregators cannot contact the worker nodes via the master node network. In one embodiment, the central aggregator manages the training of an untrained machine learning model using one or more worker nodes. Each worker node includes a training data set and can use an algorithm and training plan furnished by the central aggregator.
The issue is connecting the worker nodes with the central aggregator. Because the training data can be quite valuable, each worker node will wish to maintain the privacy of this data. Thus, the worker nodes do not want to needlessly be exposed on a network, which causes an issue for a central aggregator that wants to make use of the worker node. The device, or master node, can match worker nodes with a central aggregator by receiving requests from the central aggregators to train a model and match these requests with the availability of various worker nodes. In one embodiment, the master node can post the central aggregator request, where each interested worker node can request to be part of training for the central aggregator. With the requests from the worker nodes, the master node creates a loop network that includes the central aggregator and the relevant worker nodes, so that the central aggregator can start and manage the training of the machine learning model. In one embodiment, the central aggregator can send the algorithm and the model (along with other data) to each of the worker nodes, so the workers do not expose their training data for training of the machine learning model.
In addition to creating the loop networks, the master node can monitor the loop network, update the software on the central aggregator and the worker nodes, and can communicate information from one node to another node.
In one embodiment, each of the workers 102A-N receives the untrained model 112 and performs an algorithm to train the model 112. For example and in one embodiment, a worker 102A-N includes a training data 104A-N and a training process 106A-N that is used to train the model, training a machine learning model can be done in a variety of ways. In one embodiment, an untrained machine learning model includes initial weights, which are used to predict a set of output data. Using this output data, an optimization step is performed and the weights are updated. This process happens iteratively until a predefined stopping criterion is met. With a trained model computed by the worker 102A-N, the worker 102A-N sends the trained model 114 back to the central aggregator 108. The central aggregator 108 receives the different trained models from the different workers 102A-N and aggregates the different trained models into a single trained model. While in one embodiment the central aggregator 108 outputs this model as final trained model 114 that can be used for predictive calculations, in another embodiment, depending on the quality of the resulting model as well as other predefined criteria, the central aggregator 108 can send back this model to the different workers 102A-N to repeat the above steps.
In one embodiment, the system 100 works because the central aggregator 108 knows about and controls each of the worker nodes 102A-N because the central aggregator 108 and workers 102A-N are part of the same organization. For example and in one embodiment, the central aggregator 108 can be part of a company that produces an operating system for mobile devices and each of the worker nodes 102A-N are those mobile devices. Thus, the central aggregator 108 knows what training data each of the workers (or at least the type of training data that each worker 102A-N has). However, this type of system does not preserve the privacy of the data stored on the worker nodes 102A-N.
A different type of model training scenario can be envisioned where the central aggregator does not know the type of training data each worker node has, or possibly even the existence of a worker node. In one embodiment, an entity with a worker node may not want to expose the worker node (and its training data) to the central aggregator or to another device in general. However, that worker node is available to train a model. What is needed is a coordinating device or service that matches requested model training work from a central aggregator with available worker nodes while preserving data privacy for each worker node. As per above, the training data can be quite valuable from an economic or privacy sense. In one embodiment, a federated learning network is designed that includes a master node that is used to administer a loop network of a set of one or more worker nodes and a central aggregator. In this embodiment, the loop network is a network formed by the set of one or more worker nodes and a central aggregator for the purpose of training a model requested by the central aggregator. The master node administers this network by determining who can participate in the network (e.g., checking prerequisites and adding or removing partners for this network). In addition, the master node monitors the network traffic and operations of the loop network, maintains and updates the software of worker and central aggregator, and/or communicates information to the worker and central aggregator.
In one embodiment, each of the worker nodes 202A-N includes training data 204A-N, training process 206A-N, and loop process 208A-N. In this embodiment, each the training data 204A-N is a separate set of data that can be used to train a model (e.g., such as the training data 104A-N as illustrated in
In one embodiment, each of the central aggregators 210A-B includes untrained models 212A-B, which are models that are waiting to be trained using the training data of the worker node 202A-N. Once these models are trained, the central aggregators 210A-B stores the trained models 214-B and can be used for predictive calculations. In addition, each of the central aggregators 210A-B includes a master node loop process 218A-B which is a process that communicates with the master node 220, where the master node loop process 218A-B configures the corresponding central aggregator 210A-B using configuration information supplied by the master node 220. In addition, the master node loop process 218A-B responds to queries for information from the master node 220. While in one embodiment, there are two central aggregators and one master node illustrated, in alternate embodiment, there can be more or less numbers of either the central aggregators and/or the master node.
The master node 220, in one embodiment, administers the creation of one or more loop networks 224A-B, where each of the loop networks are used to communicatively couple one of the central aggregators 210A-B with a set of one or more worker nodes 202A-N. For example and in one embodiment, in
For example and in one embodiment, central aggregator 210A sends a request for model training work to the master node 220, where the master node 220 posts the request. Worker nodes 202A-B respond to the posted request by indicating that these nodes are willing to perform requested work. In response, the master node 220 creates loop network 224A that communicatively couples worker nodes 202A-B with central aggregator 210A. With the loop network created, the central aggregator 210A can start the model training process as described in
As another example and embodiment, central aggregator 210B sends a request for model training work to the master node 220B, where the master node 220 posts this request. Worker node 220N responds to the posted request indicating that this node is willing to perform the requested work. In response, the master node 220 creates loop network 224B that communicatively couples worker node 202N with central aggregator 210N. With the loop network 224B created, the central aggregator 210B can start the model training process as described in
In one embodiment, a worker node 202A-202N or central aggregator node 210A-B receives and broadcasts a model training request to other nodes on that loop network 224A-B. In this embodiment, one of the nodes of the loop network receives a model training request (e.g., from the central aggregating node of the loop network or from a user node associated with the loop network. This node then broadcasts this training request to one, some, or other nodes in the loop network. For example and in one embodiment, worker node 202B receives a model training request for a model from an external node (e.g., a user node), where this worker node 202B is part of the loop network 224A. The worker node 202B broadcasts this request to other nodes in the loop network 224A (e.g., worker node 202A and/or central aggregator 210A).
In a further embodiment, the master node 220 can monitor the loop network and the nodes of this network (e.g., monitoring the central aggregator and the worker nodes of this loop network). Monitoring the network and nodes is further described in
For example and in one embodiment, each of the worker nodes can be associated with a hospital that gathers a set of data from patients, test, trials, and/or other sources. This set of data can be used to train one or more models for use by pharmaceutical companies. This data, however, can be sensitive from a regulatory and/or economic perspective and the hospital would want a mechanism to keep this data private. The central aggregator can be associated with a pharmaceutical company, which would want to use one or more worker nodes to train a model. In this example, using a loop network created by the master node allows a pharmaceutical company to train a model while keeping the data of the worker node private.
As described above, the master node administers the creation and maintenance of loop networks.
At block 304, process 300 monitors the existing loop networks. In one embodiment, process 300 monitors the loop network and the nodes of this network (e.g., by monitoring the central aggregator and the worker nodes of this loop network). Monitoring the network and nodes is further described in
As described above, the master node can create one or more loop networks by matching worker nodes with requesting central aggregators.
Process 400 receives requests from worker nodes for the model training work from one of more central aggregator requests at block 406. In one embodiment, the model training can be a supervised machine learning training process that uses the training data from each of the worker nodes to train the model. In another embodiment, the model training can be a different type of model training. Process 400 matches the worker nodes to the central aggregator at block 408. In one embodiment, process 400 selects matching worker nodes by matching worker node characteristics with the requirements of the central aggregator request, including but not limited to, model type, data requirement and training requirements. Thus, each central aggregator will have a set of one or more worker nodes to use for training the model.
At block 410, process 400 sets up a loop network for each central aggregator and a corresponding set of worker nodes. In one embodiment, process 400 sends configuration commands to the central aggregator to configure the central aggregator to use the corresponding set of one or more worker nodes at its disposal for training of the model. In one embodiment, process 400 sends information that can include connection information and algorithm information. In this embodiment, the connection information can include one or more Internet Protocol (IP) addresses. Alternatively, the connection information can further include one or more pseudonym IP addresses, where a routing mechanism is used that would route network traffic through the master node, such that IP addresses are obfuscated and the master node can then use the pseudonym IP addresses to match to IP addresses. In a further alternative, virtual private network (VPN)-like methods can also be used to secure the connections. In one embodiment, the algorithm information can be the information to explain which algorithm should be run with which dataset and in which order (e.g., the compute plan). In addition, the process 400 sends configuration command(s) to each of the one or more worker nodes for this device. In one embodiment, process 400 can configure each of the worker nodes with the same or similar information used to configure the central aggregator. For example and in one embodiment, process 400 can send connection and algorithm information to each of the worker nodes. In this example, the same compute plan can be shared with the central aggregator and the worker nodes. With the central aggregator and the worker nodes configured, the loop network is created and the central aggregator can begin the process of using the worker nodes to train the model.
With the loop network created and the central aggregator managing the model training using the worker nodes, the master node can monitor the existing loop networks.
The master node can further be used to maintain the software used for loop networks on the worker nodes and/or the central aggregator.
In one embodiment, users of a federated learning environment do not have access to information beyond the machine learning results (e.g., the result of the trained model), because users of the federated learning network will be shielded from the network used for the federated learning. In this embodiment, the master node can provide a flux of information from the master node to other nodes in order to display this exported information. For example and in one embodiment, this exported information could cover opportunities for a worker node to connect to other networks, information for maintenance, proposition for services, and/or other types of scenarios. In addition, and in one embodiment, the master node can communicate information to/from a device external to the master node network to a node within the master node network, where the information is channeled through the master node. In another embodiment, the master node can communicate information from a loop node to another loop node from another node network.
In one embodiment, with a loop network setup, the loop network can train a model using the worker nodes of the loop network.
At block 804, process 800 receives the trained model part from each of the worker nodes in the loop network. In one embodiment, each worker node sends back the trained model part to the central aggregator. In this embodiment, the training data each worker node is not revealed to the central aggregator as this training data remains private to the corresponding worker node. In another embodiment, if the object includes a request to evaluate a model, process 800 receives the evaluation parts from each of the worker nodes used for the evaluation process. Process 800 assembles the trained model and block 806. The trained model is forwarded to the original requestor of the trained model. In one embodiment, process 800 assembles (or aggregates) the trained model parts from each of the worker nodes in the set of one or more worker nodes in the central aggregator node. In this embodiment, the aggregation can be a secure aggregation, where the secure aggregation blocks access by the central aggregator node to the individual updated model parts. Alternatively, process 800 can assemble the received evaluation parts from the worker nodes used for the model evaluation process.
As shown in
The mass storage 911 is typically a magnetic hard drive or a magnetic optical drive or an optical drive or a DVD RAM or a flash memory or other types of memory systems, which maintain data (e.g. large amounts of data) even after power is removed from the system. Typically, the mass storage 911 will also be a random access memory although this is not required. While
Portions of what was described above may be implemented with logic circuitry such as a dedicated logic circuit or with a microcontroller or other form of processing core that executes program code instructions. Thus processes taught by the discussion above may be performed with program code such as machine-executable instructions that cause a machine that executes these instructions to perform certain functions. In this context, a “machine” may be a machine that converts intermediate form (or “abstract”) instructions into processor specific instructions (e.g., an abstract execution environment such as a “virtual machine” (e.g., a Java Virtual Machine), an interpreter, a Common Language Runtime, a high-level language virtual machine, etc.), and/or, electronic circuitry disposed on a semiconductor chip (e.g., “logic circuitry” implemented with transistors) designed to execute instructions such as a general-purpose processor and/or a special-purpose processor. Processes taught by the discussion above may also be performed by (in the alternative to a machine or in combination with a machine) electronic circuitry designed to perform the processes (or a portion thereof) without the execution of program code.
The present invention also relates to an apparatus for performing the operations described herein. This apparatus may be specially constructed for the required purpose, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), RAMs, EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
A machine readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine readable medium includes read only memory (“ROM”); random access memory (“RAM”); magnetic disk storage media; optical storage media; flash memory devices; etc.
An article of manufacture may be used to store program code. An article of manufacture that stores program code may be embodied as, but is not limited to, one or more memories (e.g., one or more flash memories, random access memories (static, dynamic or other)), optical disks, CD-ROMs, DVD ROMs, EPROMs, EEPROMs, magnetic or optical cards or other type of machine-readable media suitable for storing electronic instructions. Program code may also be downloaded from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a propagation medium (e.g., via a communication link (e.g., a network connection)).
The preceding detailed descriptions are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the tools used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be kept in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “posting,” “creating,” “receiving,” “computing,” “exchanging,” “processing,” “configuring,” “augmenting,” “sending,” “assembling,” “monitoring,” “gathering,” “updating,” “pushing,” “aggregating,” “broadcasting,” “communicating,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The processes and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the operations described. The required structure for a variety of these systems will be evident from the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
The foregoing discussion merely describes some exemplary embodiments of the present invention. One skilled in the art will readily recognize from such discussion, the accompanying drawings and the claims that various modifications can be made without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
20306478.7 | Dec 2020 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/061417 | 12/1/2021 | WO |