Move image and mention flags in Marathon README. (tensorflow#14)

skavulya · Oct 27, 2016 · bf5076d · bf5076d
1 parent f3de16f
commit bf5076d
Show file tree

Hide file tree

Showing 5 changed files with 61 additions and 74 deletions.
diff --git a/README.md b/README.md
@@ -42,8 +42,7 @@ flags.DEFINE_string("worker_hosts", None,
 flags.DEFINE_string("job_name", None, "job name: worker or ps")
 ```
 
-Then, start your server. Parameter servers (ps jobs) should stop at this point
-because they only store variables, so they are joined with the server.
+Then, start your server. Since worker and parameter servers (ps jobs) usually share a common program, parameter servers (ps jobs) which only store variables should stop at this point and so they are joined with the server.
 
 ```python
 # Construct the cluster and start the server
@@ -68,7 +67,9 @@ intend on doing. The most common form is between-graph replication.
 
 In this mode, each worker separately constructs the exact same graph. Each
 worker then runs the graph in isolation, only sharing gradients with the
-parameter servers.
+parameter servers. This set up is illustrated by the following diagram. Please note that each dashed box indicates a task.
+![Diagram for Between-graph replication]
+  (images/between-graph_replication.png "Between-graph Replication")
 
 You must explicitly set the device before graph construction for this mode of
 training. The following code snippet from the
@@ -83,3 +84,17 @@ with tf.device(tf.train.replica_device_setter(
 
 # Run the TensorFlow graph.
 ```
+
+### Requirements To Run the Examples
+
+To run our examples, [Jinja templates](http://jinja.pocoo.org/) must be installed:
+
+```sh
+# On Ubuntu
+sudo apt-get install python-jinja2
+
+# On most other platforms
+sudo pip install Jinja2
+```
+
+Jinja is used for template expansion. There are other framework-specific requirements, please refer to the README page of each framework.
diff --git a/...thon/images/between-graph_replication.png → images/between-graph_replication.png b/...thon/images/between-graph_replication.png → images/between-graph_replication.png
diff --git a/marathon/images/chief_worker_stdout.png → images/chief_worker_stdout.png b/marathon/images/chief_worker_stdout.png → images/chief_worker_stdout.png
diff --git a/kubernetes/README.md b/kubernetes/README.md
@@ -10,16 +10,7 @@ Kubernetes.
    [Google Container Engine](https://cloud.google.com/container-engine/) if you
    want to quickly setup a Kubernetes cluster from scratch.
 
-2. We use [Jinja templates](http://jinja.pocoo.org/) for expansion. You must
-   have Jinja installed:
-
-```sh
-# On Ubuntu
-sudo apt-get install python-jinja2
-
-# On most other platforms
-sudo pip install Jinja2
-```
+2. [Jinja templates](http://jinja.pocoo.org/) must be installed.
 
 ## Steps to running the job
 

diff --git a/marathon/README.md b/marathon/README.md
@@ -1,89 +1,70 @@
-## Running Distributed TensorFlow on Mesos/Marathon
+# Running Distributed TensorFlow on Mesos/Marathon
 
-### Prerequisite
+## Prerequisite
 Before you start, you need to set up a Mesos cluster with Marathon installed and Docker Containerizer and Mesos-DNS enabled. It is also preferable to set up some shared storage such as HDFS in the cluster. All of these could be easily installed and configured with the help of [DC/OS](https://dcos.io/docs/1.7/administration/installing/custom/gui/). You need to remember the master target, DNS domain and HDFS namenode which are needed to bring up the TensorFlow cluster.
 
-### Write your training program
+## Write your training program
 This section covers instructions on how to write your trainer program, and build your docker image.
 
- 1. Write your own training program. This program must accept `worker_hosts`, `ps_hosts`, `job_name`, `task_index` as command line flags:
+ 1. Write your own training program. This program must accept `worker_hosts`, `ps_hosts`, `job_name`, `task_index` as command line flags which are then parsed to build `ClusterSpec`. After that, the task either joins with the server or starts building graphs. Please refero to the [main page](../README.md) for code snippets and description of between-graph replication. An example can be found in `docker/mnist_replica.py`.
 
-    ```python
-    # Flags for configuring the task
-    flags.DEFINE_integer("task_index", None,
-                                        "Worker task index, should be >= 0. task_index=0 is "
-                                        "the master worker task the performs the variable "
-                                        "initialization ")
-    flags.DEFINE_string("ps_hosts", None,
-                                     "Comma-separated list of hostname:port pairs")
-    flags.DEFINE_string("worker_hosts", None,
-                                     "Comma-separated list of hostname:port pairs")
-    flags.DEFINE_string("job_name", None,"job name: worker or ps")
-    ```
-
-    and parse them into `ClusterSpec` at the beginning and starts the tensorflow server before your training begins:
-
-    ```python
-    # Construct the cluster and start the server
-    ps_spec = FLAGS.ps_hosts.split(",")
-    worker_spec = FLAGS.worker_hosts.split(",")
-
-
-    cluster_spec = tf.train.ClusterSpec({
-        "ps": ps_spec,
-        "worker": worker_spec})
-
-
-    server = tf.train.Server(
-        cluster_spec, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
-
-
-    if FLAGS.job_name == "ps":
-      server.join()
-    ```
-
-  This code is included in the example located in `docker/mnist_replica.py`.
-  The worker task and parameter server task usually share a common program. Therefore in the training program, if it is a parameter server task then it will just join the server; otherwise it builds the graph and start its session. This is the typical setup for between-graph replication training which is illustrated in the following diagram. Note, each dashed box indicate a task.  
-  ![Diagram for Between-graph replication]
-  (images/between-graph_replication.png "Between-graph Replication")
+   In the case of large training input is needed by the training program, we recommend copying your data to shared storage first and then point each worker to the data. You may want to add a flag called `data_dir`. Please refer to the [adding flags](## Add Commandline Flags) section.
 
  2. Write your own Docker file which simply copies your training program into the image and optionally specify an entrypoint. An example is located in `docker/Dockerfile` or `docker/Dockerfile.hdfs` if you need the HDFS support. TensorBoard can also use the same image, but with a different entry point.
 
- 3. Build your docker image, push it to a docker repository:  
+ 3. Build your docker image, push it to a docker repository:
 
-  ```bash
+  ```sh
   cd docker
   docker build -t <image_name> -f Dockerfile.hdfs .
   # Use gcloud docker push instead if on Google Container Registry.
   docker push <image_name>
   ```
+  Please refer to [docker images](../docker/README.md) page for best practices of building docker images.
 
-### Generate Marathon Config
-The Marathon config is generated from a Jinja template where you need to customize your own cluster configuration in the file header.  
 
- 1. Copy over the template file:  
+## Generate Marathon Config
+The Marathon config is generated from a Jinja template where you need to customize your own cluster configuration in the file header.
 
-  ```  
-  cp marathon/template.json.jinja mycluster.json.jinja  
-  ```  
+ 1. Copy over the template file:
+
+  ```sh
+  cp marathon/template.json.jinja mycluster.json.jinja
+  ```
 
- 2. Edit the `mycluster.json.jinja` file. You need to specify the `name`, `image_name`, `train_dir` and optionally change number of worker and ps replicas. The `train_dir` must point to the directory on shared storage if you would like to use TensorBoard or sharded checkpoint. 
+ 2. Edit the `mycluster.json.jinja` file. You need to specify the `name`, `image_name`, `train_dir` and optionally change number of worker and ps replicas. The `train_dir` must point to the directory on shared storage if you would like to use TensorBoard or sharded checkpoint.
  3. Generate the Marathon json config:
 
-  ```bash 
-  python render_template.py mycluster.json.jinja > mycluster.json  
-  ```  
+  ```sh
+  python render_template.py mycluster.json.jinja > mycluster.json
+  ```
 
-### Start the Tensorflow cluster
-To start the cluster, simply post the Marathon json config file to the Marathon master target which is `marathon.mesos:8080` by default:  
+## Start the Tensorflow cluster
+To start the cluster, simply post the Marathon json config file to the Marathon master target which is `marathon.mesos:8080` by default:
 
-  ```bash 
-  curl -i -H 'Content-Type: application/json' -d @mycluster.json http://marathon.mesos:8080/v2/groups  
-  ```  
+  ```sh
+  curl -i -H 'Content-Type: application/json' -d @mycluster.json http://marathon.mesos:8080/v2/groups
+  ```
 
 You may want to make sure your cluster is running the training program correctly. Navigate to the DC/OS web console and look for stdout or stderr of the chief worker. The `mnist_replica.py` example would print losses for each step and final loss when training is done.
 
 ![Screenshot of the chief worker]
-(images/chief_worker_stdout.png "Screenshot of the chief worker")
-  
+(../images/chief_worker_stdout.png "Screenshot of the chief worker")
+
 If TensorBoard is enabled, navigate to `tensorboard.marathon.mesos:6006` with your browser or find out its IP address from the DC/OS web console.
+
+
+## Add Commandline Flags
+
+Let's suppose you would like to add a flag called `data_dir` into the rendered config. Before rendering the template, make following changes:
+
+ 1. Add a variable in the header of `mycluster.json.jinja`:
+  ```
+  {%- set data_dir = "hdfs://namenode/data_dir" %}  
+  ```  
+
+ 2. Add the flag into the `args` section of the template:
+   ```
+   # replace "args": ["--worker_hosts", ...] with  
+   "args": ["--data_dir", {{ data_dir}}, --worker_hosts", ...]  
+   ```