forked from cytomining/profiling-handbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-create-profiles.Rmd
510 lines (392 loc) · 15.5 KB
/
07-create-profiles.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
# Create profiles
## Create database backend
Run creation of sqlite backend as well as aggregation of measurements into per-well profiles.
This process can be very slow since the files are read from s3fs/EFS. We recommend first
downloading the CSVs files locally and then ingesting.
To do so, you need to recreate the folder structure on EBS and then run `collate.R`.
```sh
mkdir -p ~/ebs_tmp/${PROJECT_NAME}/workspace/software
cd ~/ebs_tmp/${PROJECT_NAME}/workspace/software
if [ -d cytominer_scripts ]; then rm -rf cytominer_scripts; fi
git clone https://github.com/broadinstitute/cytominer_scripts.git
cd cytominer_scripts
pyenv local 3.5.1
```
The command below
first calls `cytominer-database ingest` to create the sqlite backend, and then `aggregate.R`
to create per-well profiles. Once complete, all files are uploaded to S3 and the local cache is deleted.
```sh
mkdir -p ../../log/${BATCH_ID}/
parallel \
--max-procs ${MAXPROCS} \
--eta \
--joblog ../../log/${BATCH_ID}/collate.log \
--results ../../log/${BATCH_ID}/collate \
--files \
--keep-order \
./collate.R \
-b ${BATCH_ID} \
--plate {1} \
-c ingest_config.ini \
--tmpdir ~/ebs_tmp \
-d \
-r s3://${BUCKET}/projects/${PROJECT_NAME}/workspace :::: ${PLATES}
```
*Troublshooting tip:* For pipelines that use FlagImage to skip the measurements modules if the image failed QC, the failed images will have Image.csv files with fewer columns that the rest (because columns corresponding to aggregated measurements will be absent). The ingest command will show a warning related to sqlite: `expected X columns but found Y - filling the rest with NULL`. This is expected behavior.
This is the resulting structure of `backend` on S3 (one level below `workspace`) for `SQ00015167`:
```
└── backend
└── 2016_04_01_a549_48hr_batch1
└── SQ00015167
├── SQ00015167.csv
└── SQ00015167.sqlite
```
`SQ00015167.sqlite` is the per cell data and `SQ00015167.csv` is the aggregated per-well data.
Copy these files from S3 to EFS to continue with the rest of the processing
```sh
aws s3 sync --exclude "*.sqlite" s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/backend/${BATCH_ID}/ ~/efs/${PROJECT_NAME}/workspace/backend/${BATCH_ID}/
```
Do a quick check to view how many rows are present in each of the aggregated per-well data.
```sh
parallel \
--no-run-if-empty \
--keep-order \
wc -l ../../backend/${BATCH_ID}/{1}/{1}.csv :::: ${PLATES}
```
## Annotate
First, get metadata for the plates. This should be created beforehand and be made available in S3.
```sh
aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}/ ~/efs/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}/
```
This is the resulting structure of the metadata folder on EFS (one level below `workspace`)
```
└── metadata
└── 2016_04_01_a549_48hr_batch1
├── barcode_platemap.csv
└── platemap
└── C-7161-01-LM6-006.txt
```
`2016_04_01_a549_48hr_batch1` is the batch name – the plates (and all related data) are arranged under batches, as seen below.
`barcode_platemap.csv` is structured as shown below. `Assay_Plate_Barcode` and `Plate_Map_Name` are currently the only mandatory columns (they are used to join the metadata of the plate map with each assay plate).
Each unique entry in the `Plate_Map_Name` should have a corresponding tab-separated file `.txt` file under `platemap` (e.g. `C-7161-01-LM6-006.txt`)
```
Assay_Plate_Barcode,Plate_Map_Name
SQ00015167,C-7161-01-LM6-006
```
The tab-separated files are plate maps and are structured like this:
(This is the typical format followed by Broad Chemical Biology Platform)
```
plate_map_name well_position broad_sample mg_per_ml mmoles_per_liter solvent
C-7161-01-LM6-006 A07 BRD-K18895904-001-16-1 3.12432000000000016 9.99999999999999999 DMSO
C-7161-01-LM6-006 A08 BRD-K18895904-001-16-1 1.04143999999919895 3.33333333333076923 DMSO
C-7161-01-LM6-006 A09 BRD-K18895904-001-16-1 0.347146666668001866 1.11111111111538462 DMSO
```
NOTE:
- `plate_map_name` should be identical to the name of the file (without extension).
- `plate_map_name` and `well_position` are mandatory columns.
- If you have two sets of plates that have the same platemap but are plated with different cell lines, then create one plate map file for each cell line, e.g. `C-7161-01-LM6-006_A549.txt`, rename the `plate_map_name` to the name of the file (without extension), add a column `cell_id`, and populate it with the name of the cell line (e.g. `A549`). This should also be reflected in the `barcode_platemap.csv` file.
Next, append the metadata to the aggregated per-well data. You may choose to additionally append columns from another source ("EXTERNAL_METADATA" below), which you can specify using the `-j` flag.
```sh
cd ~/efs/${PROJECT_NAME}/workspace/software/cytominer_scripts
EXTERNAL_METADATA=../../metadata/${BATCH_ID}/cell_painting_dataset_cmap_annotations_moa.csv
parallel \
--no-run-if-empty \
--eta \
--joblog ../../log/${BATCH_ID}/annotate.log \
--results ../../log/${BATCH_ID}/annotate \
--files \
--keep-order \
./annotate.R \
-b ${BATCH_ID} \
-p {1} \
-d \
-j ${EXTERNAL_METADATA} \
-m chemical :::: ${PLATES}
```
This is the resulting structure of `backend` on EFS (one level below `workspace`) for `SQ00015167`:
```
└── backend
└── 2016_04_01_a549_48hr_batch1
└── SQ00015167
├── SQ00015167_augmented.csv
└── SQ00015167.csv
```
`SQ00015167_augmented.csv` is the aggregated per-well data, annotated with metadata.
Do a quick check to view how many rows are present in each of the annotated per-well data.
```sh
parallel \
--no-run-if-empty \
--keep-order \
wc -l ../../backend/${BATCH_ID}/{1}/{1}_augmented.csv :::: ${PLATES}
```
## Normalize
Use all wells on the plate to normalize each feature. By default, this performs robust z-scoring per feature. The default input is the annotated per-well data.
```sh
parallel \
--no-run-if-empty \
--eta \
--joblog ../../log/${BATCH_ID}/normalize.log \
--results ../../log/${BATCH_ID}/normalize \
--files \
--keep-order \
./normalize.R \
-b ${BATCH_ID} \
-p {1} \
-s \"Metadata_broad_sample_type != \'\'\'dummy\'\'\'\" :::: ${PLATES}
```
NOTE:
- don't escape quotes if not using parallel i.e. use `-s "Metadata_broad_sample_type != '''dummy'''"` if not using within parallel.
- to use a different reference distribution to compute the median and m.a.d. for z-scoring, change the filter specified using the `-s` flag.
This is the resulting structure of `backend` on EFS (one level below `workspace`) for `SQ00015167`:
```
└── backend
└── 2016_04_01_a549_48hr_batch1
└── SQ00015167
├── SQ00015167_augmented.csv
├── SQ00015167.csv
└── SQ00015167_normalized.csv
```
`SQ00015167_normalized.csv` is the robust z-scored (normalized) per-well data.
Do a quick check to view how many rows are present in each of the normalized per-well data.
```sh
parallel \
--no-run-if-empty \
--keep-order \
wc -l ../../backend/${BATCH_ID}/{1}/{1}_normalized.csv :::: ${PLATES}
```
## Select variables
Create samples to do variable selection. Sample some wells from each replicate. Below, this is done by sample 2 entire replicate plates per platemap. Use `-n` to specify number of replicate plates to be used to create the sample.
Samples are created for both, normalized and unnormalized data, because the variable selection techniques may require both.
```sh
mkdir -p ../../parameters/${BATCH_ID}/sample/
# sample normalized data
./sample.R \
-b ${BATCH_ID} \
-f "_normalized.csv$" \
-n 2 \
-o ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_normalized_sample.feather
# sample unnormalized data
./sample.R \
-b ${BATCH_ID} \
-f "_augmented.csv$" \
-n 2 \
-o ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_augmented_sample.feather
```
Make a list of variables to be preserved after `replicate_correlation` variable selection is performed.
```sh
./preselect.R \
-b ${BATCH_ID} \
-i ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_normalized_sample.feather \
-r replicate_correlation \
-s "Metadata_broad_sample_type == '''trt'''" \
-n 2
```
Make a list of variables to be preserved after `correlation_threshold` variable selection is performed.
```sh
./preselect.R \
-b ${BATCH_ID} \
-i ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_normalized_sample.feather \
-r correlation_threshold
```
Make a list of variables to be preserved after `variance_threshold` variable selection is performed.
```sh
./preselect.R \
-b ${BATCH_ID} \
-i ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_augmented_sample.feather \
-r variance_threshold \
-s "Metadata_broad_sample_type == '''control'''"
```
Some variables have previously identified as being noisy or non-informative. Create a list of variables that excludes these variables.
```sh
# manually remove some features
echo "variable" > ../../parameters/${BATCH_ID}/variable_selection/manual.txt
head -1 \
../../backend/${BATCH_ID}/${SAMPLE_PLATE_ID}/${SAMPLE_PLATE_ID}.csv \
|tr "," "\n"|grep -v Meta|grep -E -v 'Granularity_14|Granularity_15|Granularity_16|Manders|RWC' >> \
../../parameters/${BATCH_ID}/variable_selection/manual.txt
```
ALTERNATIVE: You may have already performed these steps for a different batch of data, and want to simply copy the parameters to this batch:
```sh
mkdir -p ../../parameters/${BATCH_ID}/variable_selection/
REFERENCE_BATCH_ID=2018_02_23_LKCP_DBG
aws s3 sync \
s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/parameters/${REFERENCE_BATCH_ID}/ \
~/efs/${PROJECT_NAME}/workspace/parameters/${REFERENCE_BATCH_ID}/
rsync -arzv ../../parameters/${REFERENCE_BATCH_ID}/variable_selection/ ../../parameters/${BATCH_ID}/variable_selection/
```
The previous steps only create a list of variable to be preserved for each variable selection method.
To actually apply variable selection, we compute the intersection of all these variable lists, then preserve only those columns of the normalized per-well data.
```sh
parallel \
--no-run-if-empty \
--eta \
--joblog ../../log/${BATCH_ID}/select.log \
--results ../../log/${BATCH_ID}/select \
--files \
--keep-order \
./select.R \
-b ${BATCH_ID} \
-p {1} \
-r variance_threshold,replicate_correlation,correlation_threshold,manual :::: ${PLATES}
```
This is the resulting structure of `backend` on EFS (one level below `workspace`) for `SQ00015167`:
```
└── backend
└── 2016_04_01_a549_48hr_batch1
└── SQ00015167
├── SQ00015167_augmented.csv
├── SQ00015167.csv
├── SQ00015167_normalized.csv
└── SQ00015167_normalized_variable_selected.csv
```
`SQ00015167_normalized_variable_selected.csv` is the variable-selected version of the normalized per-well data.
Do a quick check to view how many rows are present in each of the normalized per-well data.
```sh
parallel \
--no-run-if-empty \
--keep-order \
wc -l ../../backend/${BATCH_ID}/{1}/{1}_normalized_variable_selected.csv :::: ${PLATES}
```
## Aggregate replicates
Combine replicate plates of each plate map by averaging (mean).
```sh
mkdir -p ../../collated/${BATCH_ID}/
PLATE_MAPS=../../scratch/${BATCH_ID}/plate_maps.txt
csvcut -c Plate_Map_Name \
../../metadata/${BATCH_ID}/barcode_platemap.csv | \
tail -n +2|sort|uniq > \
${PLATE_MAPS}
parallel \
--no-run-if-empty \
--eta \
--joblog ../../log/${BATCH_ID}/collapse.log \
--results ../../log/${BATCH_ID}/collapse \
--keep-order \
./collapse.R \
-b ${BATCH_ID} \
-m {1} \
-f _normalized_variable_selected.csv \
-o ../../collated/${BATCH_ID}/{1}_collapsed.csv :::: ${PLATE_MAPS}
```
This is the resulting structure of `collated` on EFS (one level below `workspace`) for `2016_04_01_a549_48hr_batch1`:
```
└── collated
└── 2016_04_01_a549_48hr_batch1
└── C-7161-01-LM6-006_collapsed.csv
```
`C-7161-01-LM6-006_collapsed.csv` is the replicate averaged data for plate map C-7161-01-LM6-006.
Do a quick check to view how many rows are present in the replicate averaged data of each plate map.
```sh
parallel \
--no-run-if-empty \
--keep-order \
wc -l ../../collated/${BATCH_ID}/{1}_collapsed.csv :::: ${PLATE_MAPS}
```
Combine all averaged profiles in the batch into a single file.
```sh
csvstack \
../../collated/${BATCH_ID}/*_collapsed.csv > \
../../collated/${BATCH_ID}/${BATCH_ID}_collapsed.csv
```
## Audit
Audit each plate map for replicate reproducibility
```sh
mkdir -p ../../audit/${BATCH_ID}/
```
Audit only treated wells
```sh
parallel \
--no-run-if-empty \
--eta \
--joblog ../../log/${BATCH_ID}/audit.log \
--results ../../log/${BATCH_ID}/audit \
--files \
--keep-order \
./audit.R \
-b ${BATCH_ID} \
-m {1} \
-f _normalized_variable_selected.csv \
-s \"Metadata_broad_sample_type == \'\'\'trt\'\'\'\" \
-o ../../audit/${BATCH_ID}/{1}_audit.csv \
-l ../../audit/${BATCH_ID}/{1}_audit_detailed.csv \
-p Metadata_Plate_Map_Name,Metadata_moa,Metadata_pert_id,Metadata_broad_sample,Metadata_mmoles_per_liter,Metadata_Well :::: ${PLATE_MAPS}
```
Audit only control wells, i.e., how well do control wells in the same position correlate?
```sh
parallel \
--no-run-if-empty \
--eta \
--joblog ../../log/${BATCH_ID}/audit_control.log \
--results ../../log/${BATCH_ID}/audit_control \
--files \
--keep-order \
./audit.R \
-b ${BATCH_ID} \
-m {1} \
-f _normalized_variable_selected.csv \
-s \"Metadata_broad_sample_type == \'\'\'control\'\'\'\" \
-o ../../audit/${BATCH_ID}/{1}_audit_control.csv \
-l ../../audit/${BATCH_ID}/{1}_audit_control_detailed.csv \
-p Metadata_Well :::: ${PLATE_MAPS}
```
## Convert to other formats
Convert per-plate CSV files to GCT
```sh
parallel \
--no-run-if-empty \
--eta \
--joblog ../../log/${BATCH_ID}/csv2gct_backend.log \
--results ../../log/${BATCH_ID}/csv2gct_backend \
--files \
--keep-order \
./csv2gct.R \
../../backend/${BATCH_ID}/{1}/{1}_{2}.csv \
-o ../../backend/${BATCH_ID}/{1}/{1}_{2}.gct :::: ${PLATES} ::: augmented normalized normalized_variable_selected
```
Convert per-plate map CSV files to GCT
```sh
parallel \
--no-run-if-empty \
--eta \
--joblog ../../log/${BATCH_ID}/csv2gct_collapsed.log \
--results ../../log/${BATCH_ID}/csv2gct_collapsed \
--files \
--keep-order \
./csv2gct.R \
../../collated/${BATCH_ID}/{1}_collapsed.csv \
-o ../../collated/${BATCH_ID}/{1}_collapsed.gct :::: ${PLATE_MAPS}
```
Convert collapsed to gct
```sh
./csv2gct.R \
../../collated/${BATCH_ID}/${BATCH_ID}_collapsed.csv \
-o ../../collated/${BATCH_ID}/${BATCH_ID}_collapsed.gct
```
## Upload data
### Sync to S3
```sh
parallel \
aws s3 sync \
../../{1}/${BATCH_ID}/ \
s3://imaging-platform-dev/projects/${PROJECT_NAME}/workspace/{1}/${BATCH_ID}/ ::: audit backend batchfiles collated load_data_csv log metadata parameters scratch
```
### Sync down from S3 onto a machine
Specify location for syncing
```sh
BROAD_NFS=/cmap/imaging
```
Set variables
```sh
PROJECT_NAME=2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad
BATCH_ID=2016_04_01_a549_48hr_batch1
```
Sync the files
```sh
echo audit backend batchfiles collated load_data_csv log metadata parameters scratch | \
tr " " "\n" |
xargs -I % \
aws s3 sync \
--exclude "*.sqlite" \
s3://imaging-platform-dev/projects/${PROJECT_NAME}/workspace/%/${BATCH_ID}/ \
${BROAD_NFS}/${PROJECT_NAME}/workspace/%/${BATCH_ID}/
```