The distributed machine learning framework based on tensorflow is finally out!



Distributed TensorFlow

This directory includes the implementation of runtime distributed tensorflow, and the bottom layer uses grpc as the support library for in-process communication.

Quick start

First, you need to build a server-side executable version of tensorflow(grpc_tensorflow_server)And a grpc based client. Currently, it can only be self built based on the source code, but it will be included in the binary version released in the future. You can use the following command to build:

# CPU-only build.
$ bazel build -c opt //tensorflow/core/distributed_runtime/rpc:grpc_tensorflow_server

# GPU build.
$ bazel build -c opt --config=cuda //tensorflow/core/distributed_runtime/rpc:grpc_tensorflow_server

If you create a python dependency package from the latest source code, it automatically includes a grpc based client. If you are using a previously released binary version, you need to recompile the installation according to the installation instructions. After you have successfully built the distributed tensorflow component, you can start the server and judge whether your installation is successful by:

# Start a TensorFlow server as a single-process "cluster".
$ bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
    --cluster_spec='local|localhost:2222' --job_name=local --task_index=0 &

Then start the python communicator and start a session:

$ python
>>> import tensorflow as tf
>>> c = tf.constant("Hello, distributed TensorFlow!")
>>> sess = tf.Session("grpc://localhost:2222")
'Hello, distributed TensorFlow!'

Cluster definition

Command line argumentsgrpc_tensorflow_serverDefines the relationship between clusters. Parameters--cluster_specIt determines the number of working objects in the cluster, such as a series ofjobsAnd eachjobsMore than onetaskTerminal. All processes in the cluster must have the same--cluster_specParameters, for example:

--cluster_spec='...' Available tasks
`local\ localhost:2222` /job:local/task:0
`local\ localhost:2222;localhost:2223` /job:local/task:0`/job:local/task:1`
`worker\ worker0:2222;worker1:2222;worker2:2222,“ps\ ps0:2222;ps1:2222` /job:worker/task:0`/job:worker/task:1/job:worker/task:2/job:ps/task:0`/job:ps/task:1

Also--job_nameAnd--task_indexFlags indicate which tasks will run on the current process. Specifically,
--job_name=local --task_index=0This means that the process will be marked as
/job:local/task:0, and then all tensorflow devices in the process will use this prefix.

Manually specifying these operating parameters can be tedious, especially for a large cluster. We are developing tools that can be started programmatically, such as using a cluster manager similar to kubernetes. If you have any cluster management tools that you think are good to join in, you can put forward your suggestions on GitHub issue.

Distributed devices in annotation model

In order to put an operation on a special process, it can still be used in a distributed environment
Function, which is used to indicate whether it is on the CPU or GPU. For example:

with tf.device("/job:ps/task:0"):
  weights_1 = tf.Variable(...)
  biases_1 = tf.Variable(...)
with tf.device("/job:ps/task:1"):
  weights_2 = tf.Variable(...)
  biases_2 = tf.Variable(...)
with tf.device("/job:worker/task:7"):
  input, labels = ...
  layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
  logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)
  # ...
  train_op = ...

with tf.Session("grpc://worker7:2222") as sess:
  for _ in range(10000):

In the above example, variables are in the jobpsThe two tasks of are created, and the calculation intensive part is created in the jobworkUp. Tensorflow automatically transfers data between different jobs. (fromjobreachworkIt’s forward, and fromworkerreachpsIs gradient application).

Replicated Computation

A common training configuration (data parallel training) contains jobspsShared parameters and jobs onworkMultiple tasks on to train the same model. Each task will generally run on a different machine. There are still many ways to implement this structure in tensorflow. In the future, we will also provide a simpler way to implement it. The main ways are:

  • Build a single graph (intf.Variable nodes pinned to /job:ps)And create copies of multiple models to map to/job:workerDifferent tasks in. Each copy of the model has a differenttrain_op, and for each workeriOne or more client threads can[i])。 This method uses a singletf.Session, whose target is a worker in the cluster.

  • As above, but where the gradients from all workers are averaged. See the
    CIFAR-10 multi-GPU trainer

for an example of this form of replication. The implements synchronous training

  • Another method of distributed trainer is to use multiple graphs, one graph corresponds to one worker, and each graph contains a set of parameters(/job:ps)And a model assignment. The mechanism of container is to share variables among different graphs: once a variable is constructed, optionalcontainerParameters are determined by the same values for each copy in the graph. For larger models, this method will be more effective, after all, the whole graph is a little smaller.
    This method uses multipletf.SessionObject: each worker process will contain one, but different sessions will point to different target workers. thistf.SessionObjects can be created either in a single Python client or in multiple clients.


A typical client will build a tensorflow diagram and usetensorflow::SessionTo complete the interaction with the cluster. Clients are usually written in Python or C + +. Generally speaking, a client can interact with multiple servers at the same time (refer to the repeated training above), and a server can also serve multiple clients at the same time.

A tensorflow cluster contains one or more tensorflow servers, which are divided into a series of named jobs, and each job is responsible for a series of tasks. A cluster usually focuses on a relatively high-level goal, such as training a neural network with multiple machines in parallel.

A job will contain a series of tasks dedicated to the same goal. For example, aps(meaning parameter service) will be used to handle the work stored in updating variables. And one is called.workerThe job of will be used to host stateless nodes that are used for computation intensive. Generally speaking, tasks in a job run on different machines.

Master service
Master service is an RPC service used to interact with a series of remote distributed devices. Master service implementstensorflow::SessionInterface, and is used to coordinate multiple worker services.

A task is usually associated with a single tensorflow server’s process, belongs to a specific job and has a unique index in the job’s task list.

TensorFlow server
The process used to run grpc? Tensorflow? Server is a member of a cluster and exposes a master service and a worker service.

Worker service
An RPC service that performs part of the tensorflow diagram.