forked from cytomining/profiling-handbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-run-jobs.Rmd
226 lines (168 loc) · 8.2 KB
/
05-run-jobs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
# Run jobs
## Illumination correction
### Single node
To compute illumination functions directly on the EC2 node, run the contents of `cp_docker_commands.txt` for each plate
```sh
for PLATE_ID in $(cat ${PLATES}); do
parallel -a ../../batchfiles/${BATCH_ID}/${PLATE_ID}/illum/cp_docker_commands.txt
done
```
If this is run on the current node, this is the resulting structure of `analysis`, containing the output of `illum.cppipe`, on EFS (one level below `workspace`). Files for only `SQ00015167` are shown.
```
└── 2016_04_01_a549_48hr_batch1
└── illum
└── SQ00015167
├── SQ00015167_IllumAGP.mat
├── SQ00015167_IllumDNA.mat
├── SQ00015167_IllumER.mat
├── SQ00015167_IllumMito.mat
├── SQ00015167_IllumRNA.mat
└── SQ00015167.stderr
```
Sync this folder to S3, maintaining the same structure. If you used DCP to run this pipeline (discussed below), the files will have been stored directly on S3, in which case there's no need to do a sync.
```sh
cd ~/efs/${PROJECT_NAME}/
aws s3 sync ${BATCH_ID}/illum/${PLATE_ID} s3://${BUCKET}/projects/${PROJECT_NAME}/${BATCH_ID}/illum/${PLATE_ID}
```
### DCP
Edit the config files `illum_config.py` and `illum_config.json` in `cellpainting_scripts/dcp_config_files/` as needed. Then copy to the DCP directory and setup the compute environment
```sh
cd ~/efs/${PROJECT_NAME}/workspace/software/Distributed-CellProfiler/
pyenv shell 2.7.12
cp ../cellpainting_scripts/dcp_config_files/illum_config.py config.py
fab setup
```
Submit jobs and start the cluster, then monitor:
```sh
parallel \
python run.py submitJob \
~/efs/${PROJECT_NAME}/workspace/batchfiles/${BATCH_ID}/{1}/illum/dcp_config.json :::: ${PLATES}
python run.py \
startCluster \
../cellpainting_scripts/dcp_config_files/illum_config.json
# do this in a tmux session. Replace `APP_NAME` the value of APP_NAME in `illum_config.py`
python run.py monitor files/APP_NAMESpotFleetRequestId.json
```
## Quality control
### Process QC results into a database for CPA
```sh
cd ~/efs/${PROJECT_NAME}/workspace/software/
git clone https://[email protected]/cytomining/cytominer-database
cd cytominer-database
pyenv shell 3.5.1
pip install -e .
cd ~/efs/${PROJECT_NAME}/workspace
mkdir qc
cytominer-database ingest ~/bucket/projects/${PROJECT_NAME}/workspace/qc/${BATCH_ID}/results sqlite:///qc/${BATCH_ID}_QC.sqlite -c software/cytominer-database/cytominer_database/config/config_default.ini --no-munge
rsync qc/${BATCH_ID}_QC.sqlite ~/bucket/projects/${PROJECT_NAME}/workspace/qc/${BATCH_ID}_QC.sqlite
```
You can then download the database to your local machine; to update the S3 image paths to your local image paths
you'll need to configure and execute the following SQL statement (DB Browser for SQLite, for example, allows you
to do this easily in the GUI). You need only specify the parts of the paths that are different, not the whole path.
```sql
UPDATE Image
SET PathName_OrigBrightfield= REPLACE(PathName_OrigBrightfield, '/home/ubuntu/bucket/projects/s3/path/to/files/', '/local/path/to/files/')
WHERE PathName_OrigBrightfield LIKE '%/home/ubuntu/bucket/projects/%';
UPDATE Image
SET PathName_OrigAGP= REPLACE(PathName_OrigAGP, '/home/ubuntu/bucket/projects/s3/path/to/files/', '/local/path/to/files/')
WHERE PathName_OrigAGP LIKE '%/home/ubuntu/bucket/projects/%';
UPDATE Image
SET PathName_OrigDNA= REPLACE(PathName_OrigDNA, '/home/ubuntu/bucket/projects/s3/path/to/files/', '/local/path/to/files/')
WHERE PathName_OrigDNA LIKE '%/home/ubuntu/bucket/projects/%';
UPDATE Image
SET PathName_OrigER= REPLACE(PathName_OrigER, '/home/ubuntu/bucket/projects/s3/path/to/files/', '/local/path/to/files/')
WHERE PathName_OrigER LIKE '%/home/ubuntu/bucket/projects/%';
UPDATE Image
SET PathName_OrigMito= REPLACE(PathName_OrigMito, '/home/ubuntu/bucket/projects/s3/path/to/files/', '/local/path/to/files/')
WHERE PathName_OrigMito LIKE '%/home/ubuntu/bucket/projects/%';
UPDATE Image
SET PathName_OrigRNA= REPLACE(PathName_OrigRNA, '/home/ubuntu/bucket/projects/s3/path/to/files/', '/local/path/to/files/')
WHERE PathName_OrigRNA LIKE '%/home/ubuntu/bucket/projects/%'
```
Windows users must also then execute the following statements to change the direction of any slashes in the path.
```sql
UPDATE Image
SET PathName_OrigBrightfield= REPLACE(PathName_OrigBrightfield, '/', '\')
WHERE PathName_OrigBrightfield LIKE '%/%';
UPDATE Image
SET PathName_OrigAGP= REPLACE(PathName_OrigAGP, '/', '\')
WHERE PathName_OrigAGP LIKE '%/%';
UPDATE Image
SET PathName_OrigDNA= REPLACE(PathName_OrigDNA, '/', '\')
WHERE PathName_OrigDNA LIKE '%/%';
UPDATE Image
SET PathName_OrigER= REPLACE(PathName_OrigER, '/', '\')
WHERE PathName_OrigER LIKE '%/%';
UPDATE Image
SET PathName_OrigMito= REPLACE(PathName_OrigMito, '/', '\')
WHERE PathName_OrigMito LIKE '%/%';
UPDATE Image
SET PathName_OrigRNA= REPLACE(PathName_OrigRNA, '/', '\')
WHERE PathName_OrigRNA LIKE '%/%'
```
You can now configure your CPA properties file with the name of your new database and perform the QC.
For more information on this process, see the CellProfiler/tutorials repo.
## Analysis
### Single node
To run the analysis pipeline directly on the EC2 node, run the contents of `cp_docker_commands.txt` for each plate
```sh
for PLATE_ID in $(cat ${PLATES}); do
parallel -a ../../batchfiles/${BATCH_ID}/${PLATE_ID}/analysis/cp_docker_commands.txt
done
```
If this is run on the EC2 node, this is the resulting structure of `analysis`, containing the output of `analysis.cppipe`, on EFS (one level below `workspace`). Files for only `SQ00015167` are shown.
```
└── analysis
└── 2016_04_01_a549_48hr_batch1
└── SQ00015167
└── analysis
└── A01-1
├── Cells.csv
├── Cytoplasm.csv
├── Experiment.csv
├── Image.csv
├── Nuclei.csv
└── outlines
├── A01_s1--cell_outlines.png
└── A01_s1--nuclei_outlines.png
```
`A01-1` is site 1 of well A01. In a 384-well plate, there will be 384\*9 such folders.
Note that the file `Experiment.csv` may get created one level above, i.e., under `A01-1` (see https://github.com/CellProfiler/CellProfiler/issues/1110)
Sync this folder to S3, maintaining the same structure. If you used DCP to run this pipeline (discussed below), the files will have been stored directly on S3, in which case there's no need to do a sync.
```sh
cd ~/efs/${PROJECT_NAME}/workspace/
aws s3 sync analysis/${BATCH_ID}/${PLATE_ID}/analysis s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/analysis/${BATCH_ID}/${PLATE_ID}/analysis/
```
### DCP
Edit the config files `analysis_config.py` and `analysis_config.json` in `cellpainting_scripts/dcp_config_files/` as needed.
At the very least, replace the strings VAR_AWS_ACCOUNT_NUMBER,VAR_AWS_BUCKET,VAR_SUBNET_ID,VAR_GROUP_ID,VAR_KEYNAME with appropriate values.
You do so using sed:
```sh
cd cellpainting_scripts/dcp_config_files/
for CONFIG_FILE in analysis_config.py analysis_config.json illum_config.py illum_config.json; do
sed -i ${CONFIG_FILE} "s/VAR_AWS_ACCOUNT_NUMBER/NNNNNNNNNNN/g"
sed -i ${CONFIG_FILE} "s/VAR_AWS_BUCKET/name-of-s3-bucket/g"
sed -i ${CONFIG_FILE} "s/VAR_SUBNET_ID/subnet-NNNNNNNN/g"
sed -i ${CONFIG_FILE} "s/VAR_GROUP_ID/sg-NNNNNNNN/g"
sed -i ${CONFIG_FILE} "s/VAR_KEYNAME/filename-of-key-file-without-extension/g"
done
cd ..
```
Copy the analysis_config.py to the DCP directory and setup the compute environment.
```sh
cd ~/efs/${PROJECT_NAME}/workspace/software/Distributed-CellProfiler/
pyenv shell 2.7.12
cp ../cellpainting_scripts/dcp_config_files/analysis_config.py config.py
fab setup
```
Submit jobs and start the cluster, then monitor:
```sh
parallel \
python run.py submitJob \
~/efs/${PROJECT_NAME}/workspace/batchfiles/${BATCH_ID}/{1}/analysis/dcp_config.json :::: ${PLATES}
python run.py \
startCluster \
../cellpainting_scripts/dcp_config_files/analysis_config.json
# do this in a tmux session. Replace `APP_NAME` the value of APP_NAME in `analysis_config.py`
python run.py monitor files/APP_NAMESpotFleetRequestId.json
```