Technology blog for large scale federated learning: System Design

Time:2020-10-27

abstract

Federated learning is a distributed machine learning method, which can train a large number of scattered data on mobile devices. However, there are many problems when it is implemented. Therefore, how to design a system will naturally arise. Based on tensorflow, this paper introduces the system design of Federated learning based on mobile devices, summarizes some challenges and solutions, and discusses some unsolved problems and future directions.

1 Introduction

The real focus of the basic design of the federal learning infrastructure is still asynchronous and synchronous training algorithms.

Although the asynchronous algorithm was used in deep learning and achieved a very good effect distbelief, there is a continuous training trend of synchronization and large batch. Federated averaging also came into being. At the same time, differential privacy and secure aggregation also need some synchronization concepts in essence.

For these reasons, we will focus on support for synchronous rounds, while reducing the potential synchronization overhead through several techniques described later. Therefore, the system is designed to run a large number of SGD style algorithms and federated averaging.

The main problems to be solved are as follows:

  • Devices are distributing relevant availability in complex ways (for example, time zone dependency) and local data
  • Device connection is not reliable, execution interrupted
  • When the availability is different and the storage and computing resources of devices are limited, the business process of lock step execution is executed across devices.

2 agreement

Technology blog for large scale federated learning: System Design

2.1 basic concepts

The main participants in the protocol are devices (currently Android phones) and FL server (cloud based distributed services).

FL population: has a globally unique name that identifies the learning problem or application being solved.

FL task: specific calculations for FL population, such as training to be performed with a given hyper parameter, or evaluation of training models on local device data.

Devices will send the server a message that it is ready to run FL task for a given FL population.

When tens of thousands of devices declare their availability to the server within a certain period of time, the server will usually select hundreds of them and ask them to execute specific FL tasks (why hundreds? This is called a round between the device and the server. The device remains connected to the server throughout the round.

FL plan: contains tensorflow graph and instructions for how to execute

The fl server will tell the selected device FL plan that when the round is established, the fl server will send the current global model parameters and any other necessary FL checkpoints (serialization status of tensorflow sessions) to each participant. Each participant then performs a local calculation based on the global state and its local dataset, and sends the update back to the server in the form of FL checkpoint. The server merges these updates into its global state and repeats the process.

2.2 process

The communication protocol enables the device to push forward the global singleton model of FL population between rounds, in which each round is composed of three stages shown in Figure 1

choiceDevices that meet the standard (when charging and connected to a non billing network) periodically declare the server by opening a two-way flow. The flow is used to track activity and coordinate multi-step communication. The server selects a subset of connected devices (typically hundreds of devices per round) based on certain goals, such as the optimal number of participating devices. If a participating device is not selected, the server responds with an indication to reconnect at a later point in time.

to configureThe server is configured based on the selected device selection secure aggregation mechanism, and the server sends FL plan and FL checkpoint with global model to each device.

reportThe server is waiting for the participating device report to be updated. When updates are received, the server summarizes them using federated averaging and indicates when the reporting device will be reconnected (see also Section 2.3). If there are enough devices to report in time, the round will complete successfully and the server will update its global model, otherwise the round will be abandoned.

The agreement has a certain tolerance for not reporting in time and not responding in time.

2.3 speed control

Speed control is a flow control mechanism, which can adjust the connection mode of equipment. It enables FL servers to scale down to handle smaller FL populations and to scale up to very large FL populations.

Speed control is a simple mechanism based on server, which can suggest the best time window for reconnection to the device

In the case of small FL populations, speed control can be used to ensure that a sufficient number of devices can actively connect to the server. It is important to the speed of task processing and the security part of secure aggregation protocol. The server uses stateless probability algorithm, which does not require additional device / server communication to re connect to the rejected device

In the case of large FL populations, speed control is used to randomize the landing time of equipment to avoid the thundering herd effect. And instructs the device to connect as needed to run all scheduled FL tasks.

The speed limit also takes into account the number of active devices that oscillate day and night, and can adjust the corresponding time window to avoid excessive operation during peak hours, and does not affect FL performance in other periods.

3 equipment

Technology blog for large scale federated learning: System Design

The main purpose of equipment based learning is to maintain a local data collection warehouse for model training and evaluation. The application will store their data as an example store to FL runtime through the API provided to them. The application should limit the total storage space for this and automatically delete old data after the data time has expired. Data stored on the device may be attacked by malware or physical damage of the mobile phone, so the application follows the data security rules, including static encryption data platform.

When the fl server publishes tasks, FL runtime accesses the corresponding example store calculation model to update or evaluate the model quality of the retained data

The overall process includes the following steps:

Program configurationFL and its name are generated by example runtime. The most important requirement for training is that the (ML) model on the user’s device should avoid any negative impact on the user experience, including the impact on data usage or battery life. FL runtime requests the job scheduler to call only the idle time period of the mobile phone. When charging and connecting to the mobile phone, it works in an unlimited traffic network. Once the mobile phone is started and these conditions are no longer met, FL runtime will abort and release the allocated resources.

Task callWhen the job scheduler is invoked in a separate process, FL runtimes accesses FL server to announce that it is ready to run the task for a given FL population. The server determines if there are any FL tasks available for allocation and will return the fl plan or recommended time to try again later.

reportAfter executing the fl plan, the fl runtime reports the calculated updates and metrics to the server and clears all temporary resources.

multi-tenancy Our implementation provides a multi tenant architecture that supports training multiple FL populations in the same application (or service). In this way, multiple training activities can be coordinated, thus avoiding overload of equipment caused by multiple training tasks at one time.

authenticationThe device should be anonymous participating fl. In the case of not verifying the user’s identity, it is necessary to defend against attacks to avoid affecting the fl results. This can be done by using Android’s Android documentation mechanism, which helps to ensure that only real devices and applications can participate in FL, and provides some precautions against data poisoning caused by infected devices. Other forms of Model Manipulation – such as using uncompromising mobile phones to manipulate model content farms – are also potential areas of concern that we have not addressed within the scope of this article.

4. Server

The design of FL server is driven by the fact that it must be able to operate on multiple orders of magnitude population size and other scales. The server must be able to properly handle FL population, which can range from tens of devices (in the development process) to hundreds of millions, and can handle rounds with the number of participants ranging from tens of thousands of devices to tens of thousands. Similarly, the size of updates collected and communicated in each round can range from kilobytes to tens of megabytes. Finally, based on the idle time and charging time of the device, the traffic in and out of any given geographic area can change dramatically during the day.

4.1 actor model

FL server is designed around the actor programming model. Actor is a common primitive of concurrent computing, which uses message passing as the only communication mechanism. Each participant processes the message / event flow in strict sequence, thus forming a simple programming model. Multiple instances running the same type of actor can naturally scale to a large number of processors / machines. Participants can make local decisions, send messages to other participants, or dynamically create more participants. According to the requirements of function and scalability, explicit or automatic configuration mechanism can be used to co locate the participant instances on the same process / machine, or distributed in data centers in multiple geographic regions. Dynamic resource management and load balancing decisions can be made only by creating and placing fine-grained short instances of participants within a given FL task duration.

4.2 architecture

Technology blog for large scale federated learning: System Design

CoordinatorsIs the top player who makes the global synchronization and synchronous round. There are multiple coordinators, each of which is responsible for managing the fl population of a device. Coordinators register their address and the fl population it manages in the shared locking service, so that each FL population that other participants in the system (especially selectors) can access always has only one owner. Coordinators receive information about how many devices are connected to each selector and indicate how many devices they accept to participate based on the planned FL task. Coordinators generate a master aggregator to manage the rounds of each FL task.

SelectorResponsible for receiving and forwarding device connections. They regularly receive information from coordinators about how many devices are required for each FL population, and they decide whether to accept each device. After the master aggregator and a set of aggregators are generated, the coordinators instruct the selector to forward a subset of the devices they are connected to to to the aggregator, so that no matter how many devices are available, the coordinators can effectively assign FL tasks to devices. The method also allows selectors to be globally distributed (close to the device) and limits communication with remote coordinators.

Master AggregatorManage the rounds of each FL task. To scale and update the size based on the number of devices, they make dynamic decisions to produce one or more aggregators of delegated work.

A round of information is not written to the persistent store until the master aggregator is fully aggregated. All participants save their state in memory. Temporary participants improve scalability by eliminating the latency that is usually caused by distributed storage. In memory aggregation also eliminates the possibility of an attack on persistent logs updated for each device in the data center, as such logs do not exist.

4.3 Pipelining

The selection, configuration, and reporting in a round are sequential, while the selection phase does not depend on any input from the previous round. By running the selection phase of the next round of protocol in parallel with the configuration / reporting phase of the next round, delay optimization can be achieved. The system architecture can implement such pipelining without additional complexity because parallelism can be achieved only by running the selection process continuously by selector actor.

4.4 Failure Modes

In the event of a failure, the system will continue to run, either to complete the current round or to restart from the result of a previous round. In many cases, the failure of an actor will not prevent the round from succeeding. For example, if an aggregator or or selector crashes, only devices connected to that participant are lost. If master aggregator fails, the current round of FL tasks it manages will fail, but will be restarted by coordinators. Finally, if the coordinators die, the selector layer will detect it and regenerate it. Coorders is locked only once in the service.

5 Analytics

There are many factors and fault protection measures in the interaction between equipment and server. In addition, many platform activities occur on devices that are neither controlled nor accessible.

Therefore, it is necessary to rely on analysis to understand the actual situation on site, and to monitor the operation status of equipment statistical information. On the device side, we perform computationally intensive operations and must avoid wasting the battery or bandwidth of the phone, or reducing the performance of the phone. To ensure this, we record several activity and health parameters into the cloud. For example, the running time of the device / FL, the running time of the device, the running time of the device, the running time of the device. These log entries do not contain any personally identifiable information (PII). They are aggregated and displayed in the dashboard for analysis and then fed into the automatic time series monitor to trigger an alarm about significant deviations.

At the same time, the events of each state are recorded in the training round, and these logs are used to generate the ASCII visualization of the state transition sequence between all devices. We graphically show the visual count of these sequences on the dashboard, which allows us to quickly distinguish between different types of problems.
For example,Check in, download plan, start training, end training, start upload, errorThe order of is shown as-V[]+*And a shorter sequenceCheck in, download plan, start training, error, errorDisplay as-V [*。 The first indicates that the model is successfully trained, but the result upload fails (network problem), while the second indicates that the model is trained immediately after loading (model problem).
On the server side, similar information can be collected, such as how many devices can be accepted and rejected in each round of training, the timing of each phase of the round, the throughput in terms of uploading and downloading data, errors, etc.

Federal training should not affect the user experience, so device and server functional failures will not have an immediate negative impact. However, if it fails to operate properly, it may lead to secondary consequences of reduced equipment efficiency. For users, device utilities are critical tasks, degradation is difficult to identify and easy to diagnose. Using accurate analysis to prevent federal training from having a negative impact on the user’s device utility constitutes an important part of our engineering and risk reduction costs.

6 Secure Aggregation

Secure aggregation
Secure multiparty computing protocol, which uses encryption so that the server cannot check for updates from a single device, but only displays the sum after receiving a sufficient number of updates. We can deploy secure aggregation as a privacy enhancement of FL service, preventing other threats in the data center by ensuring that updates to individual devices, even in memory, remain encrypted. Formally, secure aggregation protects against attackers who may have access to the aggregate instance memory. What’s more, the only aggregation required for model evaluation, SGD or federated averaging is summation.

Secure aggregation is a four-round interaction protocol that can be enabled in the reporting phase of a given FL round. In each protocol round, the server collects messages from all devices in the fl round, and then uses the device message set to calculate independent responses to return to each device. The purpose of the protocol is to maintain robustness to a large number of devices dropped before the completion of the protocol. The first two rounds constitute the “preparation” phase, in which a shared secret is established, during which the retired device will not include its updates in the final aggregation. The third round constitutes a “submit” phase, in which the device uploads model updates masked by passwords, and the server accumulates the sum of these blocked updates. All devices that complete this round include their model updates in the final aggregation update of the protocol, otherwise the entire aggregation will fail. The final round of the protocol forms the termination phase, in which the device reveals enough encryption secrets to allow the server to expose aggregation model updates. Not all submitted devices need to complete this round; as long as a sufficient number of devices that started the protocol survive in the completion phase, the entire agreement will succeed.

Several costs of secure aggregation increase exponentially with the number of users, most notably the computing cost of the server. In fact, this limits the maximum size of secure aggregation to hundreds of users. To not limit the number of users who may participate in each round of Federated computation, a secure aggregation is run on each aggregation participant (see Figure 3) to aggregate the input from the aggregator device into an intermediate value; FL task defines a parameter K to securely aggregate all updates on a group of size at least K. The master aggregator then further summarizes the results of the intermediate aggregator into the final summary of the round, without the need for serve aggregation.

7 Tools And Workflow

Technology blog for large scale federated learning: System Design

Compared with the standard model engineer workflow for centralized data collection, training on equipment presents many novel challenges. First, individual training examples are not directly observable and require tools to use surrogate data in tests and simulations (Section 7.1). Second, the model cannot run interactively, but must be compiled into an FL plan for deployment through the fl server (section 7.2). Finally, since FL plans run on real devices, the infrastructure must automatically verify model resource consumption and runtime compatibility (Section 7.3). The main developers of model engineers who use FL system use a set of Python interfaces and tools to test and deploy tensorflow based FL tasks to mobile devices through FL server definition. The workflow of FL model engineer is shown in Figure 4 and described below.

7.1 Modeling and Simulation

Model engineers can start by defining FL tasks so that they can run a given FL population using python. Model engineers can declare federated learning and evaluation tasks using the tensorflow function provided to the engineer. These functions are mainly used to match input tensor to output metrics such as loss and accuracy.
During development, model engineers can use sample test data or other proxy data as input. After deployment, input is provided from the sample store on the device via the fl runtime.

The role of the model infrastructure is to enable the model engineer to use the library to build and test the corresponding FL task, thus focusing on its model rather than the language. FL task is verified according to the test data and expectations provided by engineers, which is similar to unit test in essence. Finally, FL task testing is required to deploy the model described in Section 7.3 below.
The configuration of tasks is also written in Python, including runtime parameters such as the optimal number of devices in a round, and model parameters such as learning rate. FL tasks can be defined by groups: for example, grid searches are evaluated based on learning rates. When multiple FL tasks are deployed in FL population, FL service will use a dynamic strategy to select among them, which allows alternation between training and evaluating individual models, or a / b comparison between models.
The initial exploration of hyperparameters is sometimes done in simulations using proxy data. The shape of the proxy data is similar to that on the device, but comes from different distributions – for example, text from Wikipedia can be treated as proxy data for text typed on a mobile keyboard. Our modeling tools allow FL tasks to be deployed on simulated FL servers and simulate a set of large cloud jobs to simulate devices on large proxy datasets. The simulation executes the same code as the code we run on the device, and uses the simulation to communicate with the server. Simulation can be extended to a large number of devices, sometimes used to model the agent data before improving it in the fl field.

7.2 Plan Generation

Each FL task is associated with an FL plan. Plans are automatically generated by a combination of models and configurations provided by model engineers. Typically, in data center training, the information encoded in the fl plan will be represented by a python program that choreographs the tensorflow graph. However, we do not execute Python directly on the server or device. The purpose of FL plan is to describe the required choreography independent of Python.
FL plan consists of two parts: one for the device and the other for the server. The equipment part of FL plan includes: tensorflow graph itself, selection criteria of training data in sample storage, instructions on how to batch process data and how many periods to run on. The label of nodes in the graph represents some calculations, such as loading and saving weights. The server section contains aggregation logic, which is encoded in a similar manner. Our library automatically separates part of the provided model calculations from the part (aggregation) running on the server, which runs on the device.

7.3 Versioning, Testing, Deployment

Model engineers working in federated systems can work efficiently and safely, starting or ending multiple experiments every day. However, since each FL task may occupy ram or be incompatible with the running tensorflow version, engineers rely on the fl system’s version control, testing, and deployment infrastructure for automatic security checks. Unless certain conditions are met, the server will not accept the fl plan converted to FL task for deployment. First, it must be built on auditable, peer-reviewed code. Second, it must bundle test data for each FL task that passes the simulation. Third, the resources consumed during the test must be within the safe range of the expected resources of the target population. Finally, the fl task test must pass each version of tensorflow runtime supported by FL task, which is verified by the plan to test FL task in Android simulator.
Version control is a specific challenge for machine learning on devices. In contrast to data center training, where tensorflow runtimes and graphics can be rebuilt as needed, the device may be running a version of tensorflow runtime several months earlier than FL plans generated by today’s modelers. For example, the old runtime might be missing a specific tensorflow operator, or the signature of the operator might have changed in an incompatible manner. The fl infrastructure solves this problem by generating a versioned FL plan for each task. Each versioned FL plan is derived from the default (non versioned) FL plan by transforming its calculation diagram to achieve compatibility with the deployed tensorflow version. Versioned and non versioned plans must pass the same release test and are therefore semantically equivalent. We have encountered about three incompatible changes that can be fixed every three months through graphics conversion, and smaller numbers require complex solutions.

7.4 Metrics

Once the fl task is accepted for deployment, an appropriate (versioned) plan can be provided for device check-in. At the end of FL round, the round’s aggregate model parameters and metrics are written to the server storage location selected by the model engineer. The materialized model indicators carry additional data, including metadata, such as the name of the source FL task, the fl round within the fl task, and other basic operational data. The indicator itself is a summary of the equipment report within a round through approximate order statistics and times such as mean. FL system provides analysis tools for model engineers to load these indicators into the standard Python numerical data science software package for visualization and exploration.

8 Applications

  • On-device item ranking
  • Content suggestions for on-device keyboards
  • Next word prediction

9 Operational Profile

Technology blog for large scale federated learning: System Design

A brief overview of some of the key operational indicators of the deployed FL system, which have been running a production workload for more than a year. These numbers are just examples because FL has not yet been applied to a wide variety of application sets to provide a complete feature description. In addition, all data is collected during the operation of the production system, rather than under clear control conditions for control purposes. Many of the performance metrics here depend on the device and network speed (which may vary by Region). Global model and update size (application specific); number of samples per round and computational complexity of each sample.
FL system, which can expand with the number and scale of FL population, may reach billions. Currently, the system is processing the cumulative number of FL’s of approximately 10 million active devices per day, spanning several different applications.
As mentioned earlier, at any point in time, only a portion of the device is connected to the server due to its eligibility and rhythm control. In view of this, in fact, we have observed up to 10000 devices participating at the same time. It is worth noting that the number of participating devices depends on the (local) time of day (Figure 5). Devices are more likely to be idle and charged at night, so they are more likely to be involved. For the US centric population, we observed a four fold difference in the number of participating devices within 24 hours.

On average, the amount of equipment lost due to computational errors, network failures or changes in eligibility varies from 6% to 10%. Therefore, in order to compensate for the device loss and allow the idle to be discarded, the server usually chooses 130% of the number of target devices for initial participation. This parameter can be adjusted based on the empirical distribution of device reporting time and the number of scattered targets to be ignored.

10 Future Work

  • Bias
  • Convergence Time
  • Bandwidth
  • Federated Computation

reference

[1] Bonawitz K, Eichner H, Grieskamp W, et al. Towards federated learning at scale: System design[J]. arXiv preprint arXiv:1902.01046, 2019.