07-create-profiles.Rmd

# Create profiles
## Create database backend
Run creation of sqlite backend as well as aggregation of measurements into per-well profiles.
This process can be very slow since the files are read from s3fs/EFS. We recommend first
downloading the CSVs files locally and then ingesting.

To do so, you need to recreate the folder structure on EBS and then run `collate.R`.

```sh
mkdir -p ~/ebs_tmp/${PROJECT_NAME}/workspace/software

cd ~/ebs_tmp/${PROJECT_NAME}/workspace/software

if [ -d cytominer_scripts ]; then rm -rf cytominer_scripts; fi

git clone https://github.com/broadinstitute/cytominer_scripts.git

cd cytominer_scripts

pyenv local 3.5.1

```

The command below
first calls `cytominer-database ingest` to create the sqlite backend, and then `aggregate.R`
to create per-well profiles. Once complete, all files are uploaded to S3 and the local cache is deleted.

```sh
mkdir -p  ../../log/${BATCH_ID}/
parallel \
  --max-procs ${MAXPROCS} \
  --eta \
  --joblog ../../log/${BATCH_ID}/collate.log \
  --results ../../log/${BATCH_ID}/collate \
  --files \
  --keep-order \
  ./collate.R \
  -b ${BATCH_ID} \
  --plate {1} \
  -c ingest_config.ini \
  --tmpdir ~/ebs_tmp \
  -d \
  -r s3://${BUCKET}/projects/${PROJECT_NAME}/workspace :::: ${PLATES}
```


*Troublshooting tip:* For pipelines that use FlagImage to skip the measurements modules if the image failed QC, the failed images will have Image.csv files with fewer columns that the rest (because columns corresponding to aggregated measurements will be absent). The ingest command will show a warning related to sqlite: `expected X columns but found Y - filling the rest with NULL`. This is expected behavior.

This is the resulting structure of `backend` on S3 (one level below `workspace`) for `SQ00015167`:

```
└── backend
    └── 2016_04_01_a549_48hr_batch1
        └── SQ00015167
            ├── SQ00015167.csv
            └── SQ00015167.sqlite
```

`SQ00015167.sqlite` is the per cell data and `SQ00015167.csv` is the aggregated per-well data.

Copy these files from S3 to EFS to continue with the rest of the processing

```sh
aws s3 sync --exclude "*.sqlite" s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/backend/${BATCH_ID}/ ~/efs/${PROJECT_NAME}/workspace/backend/${BATCH_ID}/
```

Do a quick check to view how many rows are present in each of the aggregated per-well data. 

```sh
parallel \
  --no-run-if-empty \
  --keep-order \
  wc -l ../../backend/${BATCH_ID}/{1}/{1}.csv :::: ${PLATES}
```

## Annotate

First, get metadata for the plates. This should be created beforehand and be made available in S3.

```sh
aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}/ ~/efs/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}/
```

This is the resulting structure of the metadata folder on EFS (one level below `workspace`)
```
└── metadata
    └── 2016_04_01_a549_48hr_batch1
        ├── barcode_platemap.csv
        └── platemap
            └── C-7161-01-LM6-006.txt
```
`2016_04_01_a549_48hr_batch1` is the batch name – the plates (and all related data) are arranged under batches, as seen below.

`barcode_platemap.csv` is structured as shown below. `Assay_Plate_Barcode` and `Plate_Map_Name` are currently the only mandatory columns (they are used to join the metadata of the plate map with each assay plate). 
Each unique entry in the `Plate_Map_Name` should have a corresponding tab-separated file `.txt` file under `platemap` (e.g. `C-7161-01-LM6-006.txt`)

```
Assay_Plate_Barcode,Plate_Map_Name
SQ00015167,C-7161-01-LM6-006
```


The tab-separated files are plate maps and are structured like this:
(This is the typical format followed by Broad Chemical Biology Platform)
```
plate_map_name  well_position broad_sample  mg_per_ml mmoles_per_liter  solvent
C-7161-01-LM6-006 A07 BRD-K18895904-001-16-1  3.12432000000000016 9.99999999999999999 DMSO
C-7161-01-LM6-006 A08 BRD-K18895904-001-16-1  1.04143999999919895 3.33333333333076923 DMSO
C-7161-01-LM6-006 A09 BRD-K18895904-001-16-1  0.347146666668001866  1.11111111111538462 DMSO
```

NOTE: 
- `plate_map_name` should be identical to the name of the file (without extension).
- `plate_map_name` and `well_position` are mandatory columns.
- If you have two sets of plates that have the same platemap but are plated with different cell lines, then create one plate map file for each cell line, e.g. `C-7161-01-LM6-006_A549.txt`, rename the `plate_map_name` to the name of the file (without extension), add a column `cell_id`, and populate it with the name of the cell line (e.g. `A549`). This should also be reflected in the `barcode_platemap.csv` file.

Next, append the metadata to the aggregated per-well data. You may choose to additionally append columns from another source ("EXTERNAL_METADATA" below), which you can specify using the `-j` flag. 

```sh
cd  ~/efs/${PROJECT_NAME}/workspace/software/cytominer_scripts

EXTERNAL_METADATA=../../metadata/${BATCH_ID}/cell_painting_dataset_cmap_annotations_moa.csv

parallel \
  --no-run-if-empty \
  --eta \
  --joblog ../../log/${BATCH_ID}/annotate.log \
  --results ../../log/${BATCH_ID}/annotate \
  --files \
  --keep-order \
  ./annotate.R \
  -b ${BATCH_ID} \
  -p {1} \
  -d \
  -j ${EXTERNAL_METADATA} \
  -m chemical :::: ${PLATES}
  ```

This is the resulting structure of `backend` on EFS (one level below `workspace`) for `SQ00015167`:

```
└── backend
    └── 2016_04_01_a549_48hr_batch1
        └── SQ00015167
            ├── SQ00015167_augmented.csv
            └── SQ00015167.csv
```

`SQ00015167_augmented.csv` is the aggregated per-well data, annotated with metadata.

Do a quick check to view how many rows are present in each of the annotated per-well data. 

```sh
parallel \
  --no-run-if-empty \
  --keep-order \
  wc -l ../../backend/${BATCH_ID}/{1}/{1}_augmented.csv :::: ${PLATES}
```

## Normalize


Use all wells on the plate to normalize each feature. By default, this performs robust z-scoring per feature. The default input is the annotated per-well data.

```sh
parallel \
  --no-run-if-empty \
  --eta \
  --joblog ../../log/${BATCH_ID}/normalize.log \
  --results ../../log/${BATCH_ID}/normalize \
  --files \
  --keep-order \
  ./normalize.R \
  -b ${BATCH_ID} \
  -p {1} \
  -s \"Metadata_broad_sample_type != \'\'\'dummy\'\'\'\" :::: ${PLATES}
```

NOTE: 
- don't escape quotes if not using parallel i.e. use `-s "Metadata_broad_sample_type != '''dummy'''"` if not using within parallel.
- to use a different reference distribution to compute the median and m.a.d. for z-scoring, change the filter specified using the `-s` flag.

This is the resulting structure of `backend` on EFS (one level below `workspace`) for `SQ00015167`:

```
└── backend
    └── 2016_04_01_a549_48hr_batch1
        └── SQ00015167
            ├── SQ00015167_augmented.csv
            ├── SQ00015167.csv
            └── SQ00015167_normalized.csv
```

`SQ00015167_normalized.csv` is the robust z-scored (normalized) per-well data.

Do a quick check to view how many rows are present in each of the normalized per-well data. 

```sh
parallel \
  --no-run-if-empty \
  --keep-order \
  wc -l ../../backend/${BATCH_ID}/{1}/{1}_normalized.csv :::: ${PLATES}
```

## Select variables

Create samples to do variable selection. Sample some wells from each replicate. Below, this is done by sample 2 entire replicate plates per platemap. Use `-n` to specify number of replicate plates to be used to create the sample. 

Samples are created for both, normalized and unnormalized data, because the variable selection techniques may require both.


```sh
mkdir -p ../../parameters/${BATCH_ID}/sample/

# sample normalized data
./sample.R \
  -b ${BATCH_ID} \
  -f "_normalized.csv$" \
  -n 2 \
  -o ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_normalized_sample.feather

# sample unnormalized data
./sample.R \
  -b ${BATCH_ID} \
  -f "_augmented.csv$" \
  -n 2 \
  -o ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_augmented_sample.feather
```


Make a list of variables to be preserved after `replicate_correlation` variable selection is performed.

```sh
./preselect.R \
  -b ${BATCH_ID} \
  -i ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_normalized_sample.feather \
  -r replicate_correlation \
  -s "Metadata_broad_sample_type == '''trt'''" \
  -n 2
```

Make a list of variables to be preserved after `correlation_threshold` variable selection is performed.

```sh
./preselect.R \
  -b ${BATCH_ID} \
  -i ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_normalized_sample.feather \
  -r correlation_threshold
```

Make a list of variables to be preserved after `variance_threshold` variable selection is performed.

```sh
./preselect.R \
  -b ${BATCH_ID} \
  -i ../../parameters/${BATCH_ID}/sample/${BATCH_ID}_augmented_sample.feather \
  -r variance_threshold \
  -s "Metadata_broad_sample_type == '''control'''"
```

Some variables have previously identified as being noisy or non-informative. Create a list of variables that excludes these variables.

```sh
# manually remove some features
echo "variable" > ../../parameters/${BATCH_ID}/variable_selection/manual.txt

head -1 \
  ../../backend/${BATCH_ID}/${SAMPLE_PLATE_ID}/${SAMPLE_PLATE_ID}.csv \
  |tr "," "\n"|grep -v Meta|grep -E -v 'Granularity_14|Granularity_15|Granularity_16|Manders|RWC' >> \
  ../../parameters/${BATCH_ID}/variable_selection/manual.txt

```

ALTERNATIVE: You may have already performed these steps for a different batch of data, and want to simply copy the parameters to this batch:

```sh
mkdir -p ../../parameters/${BATCH_ID}/variable_selection/

REFERENCE_BATCH_ID=2018_02_23_LKCP_DBG

aws s3 sync \
  s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/parameters/${REFERENCE_BATCH_ID}/ \
  ~/efs/${PROJECT_NAME}/workspace/parameters/${REFERENCE_BATCH_ID}/

rsync -arzv ../../parameters/${REFERENCE_BATCH_ID}/variable_selection/ ../../parameters/${BATCH_ID}/variable_selection/

```

The previous steps only create a list of variable to be preserved for each variable selection method. 
To actually apply variable selection, we compute the intersection of all these variable lists, then preserve only those columns of the normalized per-well data.

```sh
parallel \
  --no-run-if-empty \
  --eta \
  --joblog ../../log/${BATCH_ID}/select.log \
  --results ../../log/${BATCH_ID}/select \
  --files \
  --keep-order \
  ./select.R \
  -b ${BATCH_ID} \
  -p {1} \
  -r variance_threshold,replicate_correlation,correlation_threshold,manual :::: ${PLATES}
```

This is the resulting structure of `backend` on EFS (one level below `workspace`) for `SQ00015167`:

```
└── backend
    └── 2016_04_01_a549_48hr_batch1
        └── SQ00015167
            ├── SQ00015167_augmented.csv
            ├── SQ00015167.csv
            ├── SQ00015167_normalized.csv
            └── SQ00015167_normalized_variable_selected.csv
```

`SQ00015167_normalized_variable_selected.csv` is the variable-selected version of the normalized per-well data.

Do a quick check to view how many rows are present in each of the normalized per-well data. 

```sh
parallel \
  --no-run-if-empty \
  --keep-order \
  wc -l ../../backend/${BATCH_ID}/{1}/{1}_normalized_variable_selected.csv :::: ${PLATES}
```

## Aggregate replicates
Combine replicate plates of each plate map by averaging (mean). 

```sh
mkdir -p ../../collated/${BATCH_ID}/

PLATE_MAPS=../../scratch/${BATCH_ID}/plate_maps.txt

csvcut -c Plate_Map_Name \
  ../../metadata/${BATCH_ID}/barcode_platemap.csv | \
  tail -n +2|sort|uniq > \
  ${PLATE_MAPS}

parallel \
  --no-run-if-empty \
  --eta \
  --joblog ../../log/${BATCH_ID}/collapse.log \
  --results ../../log/${BATCH_ID}/collapse \
  --keep-order \
  ./collapse.R \
  -b ${BATCH_ID} \
  -m {1} \
  -f _normalized_variable_selected.csv \
  -o ../../collated/${BATCH_ID}/{1}_collapsed.csv :::: ${PLATE_MAPS}

```


This is the resulting structure of `collated` on EFS (one level below `workspace`) for `2016_04_01_a549_48hr_batch1`:

```
└── collated
    └── 2016_04_01_a549_48hr_batch1
        └── C-7161-01-LM6-006_collapsed.csv
```

`C-7161-01-LM6-006_collapsed.csv` is the replicate averaged data for plate map C-7161-01-LM6-006.

Do a quick check to view how many rows are present in the replicate averaged data of each plate map. 

```sh
parallel \
  --no-run-if-empty \
  --keep-order \
  wc -l ../../collated/${BATCH_ID}/{1}_collapsed.csv :::: ${PLATE_MAPS}
```


Combine all averaged profiles in the batch into a single file.

```sh
csvstack \
  ../../collated/${BATCH_ID}/*_collapsed.csv > \
  ../../collated/${BATCH_ID}/${BATCH_ID}_collapsed.csv
```

## Audit

Audit each plate map for replicate reproducibility

```sh
mkdir -p ../../audit/${BATCH_ID}/
```

Audit only treated wells

```sh
parallel \
  --no-run-if-empty \
  --eta \
  --joblog ../../log/${BATCH_ID}/audit.log \
  --results ../../log/${BATCH_ID}/audit \
  --files \
  --keep-order \
  ./audit.R \
  -b ${BATCH_ID} \
  -m {1} \
  -f _normalized_variable_selected.csv \
  -s \"Metadata_broad_sample_type == \'\'\'trt\'\'\'\" \
  -o ../../audit/${BATCH_ID}/{1}_audit.csv \
  -l ../../audit/${BATCH_ID}/{1}_audit_detailed.csv \
  -p Metadata_Plate_Map_Name,Metadata_moa,Metadata_pert_id,Metadata_broad_sample,Metadata_mmoles_per_liter,Metadata_Well :::: ${PLATE_MAPS}
```

Audit only control wells, i.e., how well do control wells in the same position correlate?

```sh
parallel \
  --no-run-if-empty \
  --eta \
  --joblog ../../log/${BATCH_ID}/audit_control.log \
  --results ../../log/${BATCH_ID}/audit_control \
  --files \
  --keep-order \
  ./audit.R \
  -b ${BATCH_ID} \
  -m {1} \
  -f _normalized_variable_selected.csv \
  -s \"Metadata_broad_sample_type == \'\'\'control\'\'\'\" \
  -o ../../audit/${BATCH_ID}/{1}_audit_control.csv \
  -l ../../audit/${BATCH_ID}/{1}_audit_control_detailed.csv \
  -p Metadata_Well :::: ${PLATE_MAPS}
```


## Convert to other formats

Convert per-plate CSV files to GCT

```sh
parallel \
  --no-run-if-empty \
  --eta \
  --joblog ../../log/${BATCH_ID}/csv2gct_backend.log \
  --results ../../log/${BATCH_ID}/csv2gct_backend \
  --files \
  --keep-order \
  ./csv2gct.R \
  ../../backend/${BATCH_ID}/{1}/{1}_{2}.csv \
  -o ../../backend/${BATCH_ID}/{1}/{1}_{2}.gct :::: ${PLATES} ::: augmented normalized normalized_variable_selected 
```

Convert per-plate map CSV files to GCT

```sh
parallel \
  --no-run-if-empty \
  --eta \
  --joblog ../../log/${BATCH_ID}/csv2gct_collapsed.log \
  --results ../../log/${BATCH_ID}/csv2gct_collapsed \
  --files \
  --keep-order \
  ./csv2gct.R \
  ../../collated/${BATCH_ID}/{1}_collapsed.csv \
  -o ../../collated/${BATCH_ID}/{1}_collapsed.gct :::: ${PLATE_MAPS}
```

Convert collapsed to gct
```sh
./csv2gct.R \
  ../../collated/${BATCH_ID}/${BATCH_ID}_collapsed.csv \
  -o ../../collated/${BATCH_ID}/${BATCH_ID}_collapsed.gct
```

## Upload data
### Sync to S3

```sh
parallel \
  aws s3 sync \
  ../../{1}/${BATCH_ID}/ \
  s3://imaging-platform-dev/projects/${PROJECT_NAME}/workspace/{1}/${BATCH_ID}/ ::: audit backend batchfiles collated load_data_csv log metadata parameters scratch
```

### Sync down from S3 onto a machine

Specify location for syncing 

```sh
BROAD_NFS=/cmap/imaging
```

Set variables

```sh
PROJECT_NAME=2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad

BATCH_ID=2016_04_01_a549_48hr_batch1
```

Sync the files

```sh
echo audit backend batchfiles collated load_data_csv log metadata parameters scratch | \
  tr " " "\n" |
  xargs -I % \
  aws s3 sync \
  --exclude "*.sqlite" \
  s3://imaging-platform-dev/projects/${PROJECT_NAME}/workspace/%/${BATCH_ID}/ \
  ${BROAD_NFS}/${PROJECT_NAME}/workspace/%/${BATCH_ID}/ 
```