Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parameter server train & side-car eval on k8s #182

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM tensorflow/tensorflow:nightly

RUN apt-get install -y python3 && \
apt install python3-pip

RUN pip3 install absl-py && \
pip3 install portpicker

# Install git
RUN apt-get update && \
apt-get install -y git && \
apt-get install -y vim

RUN git clone --single-branch --branch benchmark https://github.com/tensorflow/models.git && \
mv models tensorflow_models && \
git clone https://github.com/tensorflow/model-optimization.git && \
mv model-optimization tensorflow_model_optimization

COPY resnet_cifar_ps_strategy.py /

ENV PYTHONPATH "${PYTHONPATH}:/:/tensorflow_models"
CMD ["python", "/resnet_cifar_ps_strategy.py"]
167 changes: 167 additions & 0 deletions distribution_strategy/parameter_server_training/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Parameter Server Training Using Distribution Strategies

This directory provides an example of running parameter server training with Distribution Strategies.

Please first read the [documentation](https://www.tensorflow.org/tutorials/distribute/parameter_server_training) of Distribution Strategy for parameter server training. We also assume that readers of this page are familiar with [Google Cloud](https://cloud.google.com/) and its [Kubernetes Engine](https://cloud.google.com/kubernetes-engine/).

This directory contains the following files:

- kubernetes/template.yaml.jinja: jinja template used for generating Kubernetes manifests
- kubernetes/render_template.py: script for rendering the jinja template
- Dockerfile.resnet_cifar_ps_strategy: a docker file to build the model image
- resnet_cifar_ps_strategy.py: a ResNet example using CIFAR-10 dataset for parameter server training
## Prerequisites

1. First you need to have a Google Cloud project. Either create a new project or use an existing one.

2. Install
[gcloud commandline tools](https://cloud.google.com/functions/docs/quickstart)
on your system, login, set project and zone, etc.

3. Install [Docker](https://docs.docker.com/get-docker/) for your system

4. Install kubectl:

```bash
gcloud components install kubectl
```
5. Start a Kubernetes cluster either with `gcloud` command as shown below or with
[GKE](https://cloud.google.com/kubernetes-engine/) web UI. Using more CPUs or nodes may require increasing your CPU [quotas](https://cloud.google.com/compute/quotas#requesting_additional_quota).

```bash
gcloud container clusters create <cluster_name> --zone=us-west1-a --num-nodes=6 --machine-type=e2-standard-4
```

6. Set context for `kubectl` so that `kubectl` knows which cluster to use:

```bash
kubectl config use-context <cluster_name>
```

7. Create a
[service account](https://cloud.google.com/compute/docs/access/service-accounts)
and download its key file in JSON format. Assign Storage Admin role for
[Google Cloud Storage](https://cloud.google.com/storage/) to this service account:

```bash
gcloud iam service-accounts create <service_account_id> --display-name="<display_name>"
```

```bash
gcloud projects add-iam-policy-binding <project-id> \
--member="serviceAccount:<service_account_id>@<project_id>.iam.gserviceaccount.com" \
--role="roles/storage.admin"
```

8. Create a Kubernetes secret from the JSON key file of your service account:

```bash
kubectl create secret generic credential --from-file=key.json=<path_to_json_file>
```

9. Enable GCR ([Google Container Registry](https://cloud.google.com/container-registry)) service for your project using either GCP web UI or gcloud tool:

```bash
gcloud services enable containerregistry.googleapis.com
```

10. Configure Docker to authenticate with Container Registry

```bash
gcloud auth configure-docker
```
## How to run the example

1. Create three buckets for model data, checkpoints and training logs using either GCP web UI or gsutil tool (included with the gcloud tool you have installed above):

```bash
gsutil mb gs://<bucket_name>
```
You will use these bucket names to modify `data_dir`, `checkpoint_dir` and `train_log_dir` in step #4.


2. Download CIFAR-10 data and place them in your data_dir bucket. Head to the [ResNet in TensorFlow](https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10) directory to obtain CIFAR-10 data. Alternatively, you can use this [direct link](https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) to download and extract the data yourself as well.

```bash
python cifar10_download_and_extract.py
```

Upload the contents of cifar-10-batches-bin directory to your `data_dir` bucket.

```bash
gsutil -m cp cifar-10-batches-bin/* gs://<your_data_dir>/

```

3. Now let's build the Docker image:

```bash
docker build --no-cache -t resnet_cifar_ps_strategy:v1 -f Dockerfile.resnet_cifar_ps_strategy .

```

and push the image to
[Google Cloud Container Registery](https://cloud.google.com/container-registry/):

```bash
docker tag resnet_cifar_ps_strategy:v1 gcr.io/<your project>/resnet_cifar_ps_strategy:v1
docker push gcr.io/<your project>/resnet_cifar_ps_strategy:v1
```

4. Modify the variables in template.yaml.jinja. You may want to change `name`,
`image`, `train_log_dir`, `script` and `cmdline_args`.

* `name`: name your cluster, e.g. "my-parameter-server-example".
* `image`: the name of your docker image.
* `worker_replicas`: number of worker pods.
* `ps_replicas`: number of parameter server pods.
* `num_gpus_per_worker`: number of GPUs (this does not apply for this example since parameter server distribution strategy does not have GPU support yet)
* `has_coordinator`: flag for creating coordinator job
* `has_eval`: flag for creating evaluator job (this is set to False in the default template in order to use inline distributed evaluation. Setting this flag to True enables side-car evaluation.)
* `has_tensorboard`: flag for creating tensorboard job
* `script`: the script in the docker image to run.
* `train_log_dir`: used for logging training accuracy
* `cmdline_args`: the command line arguments passed to the `script`.
* `credential_secret_json`: the filename that was registered to Kubernetes as a secret.
* `credential_secret_key`: the name of the Kubernetes secret used for storing
your service account key.
* `port`: the port for all tasks including tensorboard.
* `use_node_port`: flag for using NodePort as type of service. Jinja template generates ingress only for tensorboard when this flag is set to `true`. Setting this flag to `false` enables LoadBalancer for all pods; assigning them external IPs (which may be limited by your public IP address quota).

5. Start the training and evaluation on the cluster.

You may want to verify the generated kubernetes manifests by running the following:

```bash
cd kubernetes
python render_template.py template.yaml.jinja | kubectl create -f - --dry-run=client
```

After making sure that the above command succeeds, you can start the cluster (removing the dry-run flag):

```bash
python render_template.py template.yaml.jinja | kubectl create -f -
```
You'll see that your cluster has started training. You can inspect logs of
workers or use tensorboard to watch your model training.

```bash
kubectl get pods
```

```bash
kubectl logs -f <pod_id>
```

6. You can find the TensorBoard service public IP address on Services & Ingress page of GKE, and access TensorBoard on http://<tensorboard_ip> (or http://<tensorboard_ip>:5000 if you have set `use_node_port` to `false`)using your browser.

The training accuracy graph shall look like the following:

![Traning accuracy - Tensorboard](images/tf-dist-ps-tensorboard.png)

7. Destroy the cluster

```bash
gcloud container clusters delete <cluster_name>
```

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/usr/bin/env python

from __future__ import print_function

import jinja2
import sys

if len(sys.argv) != 2:
print("usage: {} [template-file]".format(sys.argv[0]), file=sys.stderr)
sys.exit(1)
with open(sys.argv[1], "r") as f:
print(jinja2.Template(f.read()).render())
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
{%- set name = "resnet-cifar-ps-strategy-example" -%}
{%- set image = "gcr.io/tensorflow-experimental/resnet_cifar_ps_strategy:v1" -%}
{%- set worker_replicas = 5 -%}
{%- set ps_replicas = 2 -%}
{%- set num_gpus_per_worker = 0 -%}
{%- set has_coordinator = True -%}
{%- set has_eval = False -%}
{%- set has_tensorboard = True -%}
{%- set script = "/resnet_cifar_ps_strategy.py" -%}
{%- set train_log_dir = "gs://cifar10-train-log/" -%}
{%- set cmdline_args = [
"--data_dir=gs://cifar10-data/",
"--checkpoint_dir=gs://cifar10-ckpt/",
"--train_log_dir=" + train_log_dir
] -%}
{%- set credential_secret_json = "key.json" -%}
{%- set credential_secret_key = "credential" -%}
{%- set port = 5000 -%}
{%- set use_node_port = True -%}


{%- set replicas = {
"worker": worker_replicas,
"ps": ps_replicas,
"chief": has_coordinator|int,
"evaluator": has_eval|int,
"tensorboard": has_tensorboard|int
} -%}

{%- macro worker_hosts() -%}
{% for i in range(worker_replicas) %}
\"{{ name }}-worker-{{ i }}:{{ port }}\"{%- if not loop.last -%},{%- endif -%}
{% endfor %}
{%- endmacro -%}

{%- macro ps_hosts() -%}
{% for i in range(ps_replicas) %}
\"{{ name }}-ps-{{ i }}:{{ port }}\"{%- if not loop.last -%},{%- endif -%}
{% endfor %}
{%- endmacro -%}

{%- macro tf_config(task_type, task_id) -%}
{
\"cluster\": {
\"worker\": [{{ worker_hosts() }}]
{%- if ps_replicas > 0 %},
\"ps\": [{{ ps_hosts() }}
]{% endif %}
{%- if has_coordinator %},
\"chief\": [
\"{{ name }}-chief-0:{{ port }}\"
]
{%- endif %}
},
\"task\": {
\"type\": \"{{ task_type }}\",
\"index\": \"{{ task_id }}\"
}
}
{%- endmacro -%}

{% for job in ["chief", "worker", "ps", "evaluator", "tensorboard"] -%}
{%- for i in range(replicas[job]) -%}
{% if job == "tensorboard" and use_node_port %}
kind: Ingress
apiVersion: networking.k8s.io/v1beta1
metadata:
name: tensorboard-ingress
spec:
backend:
serviceName: {{ name }}-{{ job }}-{{ i }}
servicePort: {{ port }}
---
{% endif -%}
kind: Service
apiVersion: v1
metadata:
name: {{ name }}-{{ job }}-{{ i }}
spec:
type: {{ 'NodePort' if use_node_port else 'LoadBalancer' }}
selector:
name: {{ name }}
job: {{ job }}
task: "{{ i }}"
ports:
- port: {{ port }}
{%- if use_node_port %}
targetPort: {{ port }}
{%- endif %}
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: {{ name }}-{{ job }}-{{ i }}
spec:
replicas: 1
selector:
matchLabels:
name: {{ name }}
job: {{ job }}
task: "{{ i }}"
template:
metadata:
labels:
name: {{ name }}
job: {{ job }}
task: "{{ i }}"
spec:
containers:
{%- if job == "tensorboard" %}
- name: tensorflow
image: tensorflow/tensorflow
{%- else %}
- name: tensorflow
image: {{ image }}
{%- endif %}
env:
{%- if job != "tensorboard" %}
- name: TF_CONFIG
value: "{{ tf_config(job, i) }}"
{%- endif %}
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/var/secrets/google/{{ credential_secret_json }}"
ports:
- containerPort: {{ port }}
{%- if job == "tensorboard" %}
command:
- "tensorboard"
args:
- "--logdir={{ train_log_dir }}"
- "--port={{ port }}"
- "--host=0.0.0.0"
{%- else %}
command:
- "python"
- "{{ script }}"
{%- for cmdline_arg in cmdline_args %}
- "{{ cmdline_arg }}"
{%- endfor -%}
{%- endif %}
volumeMounts:
- name: credential
mountPath: /var/secrets/google
volumes:
- name: credential
secret:
secretName: {{ credential_secret_key }}
---
{% endfor %}
{%- endfor -%}
Loading