Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline_chipseq.py and reproducibility #68

Open
dievsky opened this issue Sep 28, 2018 · 15 comments
Open

pipeline_chipseq.py and reproducibility #68

dievsky opened this issue Sep 28, 2018 · 15 comments
Assignees

Comments

@dievsky
Copy link
Contributor

dievsky commented Sep 28, 2018

We want to create a fully reproducible Docker image pipeline for the upcoming SPAN paper. It will be based on an existing script, pipeline_chipseq.py.

@dievsky dievsky self-assigned this Sep 28, 2018
@dievsky
Copy link
Contributor Author

dievsky commented Sep 28, 2018

I've downloaded a few FASTQ files in /mnt/stripe/bio/raw-data/geo-samples/GSE103714/fastq.

Problem: the script requires two paths, path_to_directory and path_to_indexes, but doesn't indicate what should be there. It seems that it can generate indexes on its own.
Running python /mnt/stripe/washu/pipeline_chipseq.py /mnt/stripe/bio/raw-data/geo-samples/GSE103714 /mnt/stripe/bio/raw-data/geo-samples/GSE103714 mm9.

@dievsky
Copy link
Contributor Author

dievsky commented Sep 28, 2018

The script successfully downloaded mm9 genome in .fa format, but failed to create bam/sam alignments.
bowtie logs shows that the index is missing. Apparently, the current script doesn't generate bowtie indexes.
I've generated bowtie indexes manually via bowtie-build and rerun the pipeline.

@dievsky
Copy link
Contributor Author

dievsky commented Sep 28, 2018

It took some guessing to determine where does pipeline want the indexes. Apparently, at the top level of the path_to_indexes (the genome was downloaded to path_to_indexes/mm9).
Pipeline failed with cannot open file '/home/user/phantompeakqualtools/run_spp.R': No such file or directory message. The file is indeed absent. It's probably on franklin.

@olegs
Copy link
Contributor

olegs commented Sep 28, 2018

Actually, pipeline should generate indices automatically; we should investigate why it failed.

@dievsky
Copy link
Contributor Author

dievsky commented Oct 2, 2018

Investigation of bowtie-build failure was successful. It turned out that the script expects the immediate FASTQ-containing directory as an argument, not its parent.
Another curious bug: if the provided directory ends with /, the script creates a child folder _bams instead of a sibling folder fastq_bams. I'll add input appropriate input sanitization.
Also, there's no /home/user/phantompeakqualtools on either rosalind or franklin.

dievsky added a commit that referenced this issue Oct 2, 2018
@dievsky
Copy link
Contributor Author

dievsky commented Oct 2, 2018

Cloned phantompeakqualtools from https://github.com/crazyhottommy/phantompeakqualtools to rosalind.

@olegs
Copy link
Contributor

olegs commented Oct 2, 2018

I wonder if you have read https://github.com/JetBrains-Research/washu/blob/master/README.md.

@dievsky
Copy link
Contributor Author

dievsky commented Oct 2, 2018

Sure haven't! Thanks for the info.

@dievsky
Copy link
Contributor Author

dievsky commented Oct 2, 2018

Tinkering with docker image.
There is a permission problem. pipeline_chipseq.py requires writable directories. Docker mounts the data directories while keeping the ownership and permissions info, so they now belong to a user named e.g. 1003 and can't be written to from inside the docker unless the permissions are 777. And seeing as this is a mounted filesystem, even root user can't override this.
If we expect docker to at least produce persistent results which can be examined later, we should deal with this somehow. One idea is https://denibertovic.com/posts/handling-permissions-with-docker-volumes/ .

@dievsky
Copy link
Contributor Author

dievsky commented Oct 8, 2018

Implemented permissions workaround; it seems to work flawlessly.

Another interesting issue is how the pipeline determines which steps were already completed.
For example, if the pipeline is relaunched after creating BAM files, the following situation occurs:

  • the generated BAM files reside happily in fastq_bams folder
  • pipeline (bowtie.sh:72, to be more precise) looks for them in $WORK_DIR, which is fastq, doesn't find them there and spends several hours and several hundreds GB generating them anew;
  • pipeline dies horribly since it can't move fastq/*.bam to fastq_bams/, as the target file already exists.

It would naturally be better if pipeline determined complete steps correctly or, at very least, didn't crash after half a day of repeating them.
We could probably add a more adequate skipping condition to the pipeline_chipseq Python script.

@dievsky
Copy link
Contributor Author

dievsky commented Oct 8, 2018

bam_qc.sh first checks whether bedtools are in PATH (and exits if they're not), and then loads the appropriate module. This is most likely a bug.

@olegs
Copy link
Contributor

olegs commented Oct 8, 2018

I'd say that pipeline wasn't designed to recover after bugs correctly.
bam_qc.sh - It's a bug, for sure.

@dievsky
Copy link
Contributor Author

dievsky commented Oct 10, 2018

Fixed bam_qc.sh.
chipseq_pipeline.sh in Docker successfully finished (though missing wget prevented RSEG script from downloading the deadzones; I've since added wget to the Docker image). I'll do a clean launch now; if it succeeds, we can add labels and tuning to the pipeline.

@dievsky
Copy link
Contributor Author

dievsky commented Oct 10, 2018

A weird error popped out:

[INFO   ]         multiqc : This is MultiQC v1.0.dev0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '/data/fastq'
[INFO   ]          fastqc : Found 4 reports
[INFO   ]         multiqc : Report      : multiqc_report.html
Traceback (most recent call last):
  File "/opt/conda/envs/bio/bin/multiqc", line 4, in <module>
    __import__('pkg_resources').run_script('multiqc==1.0.dev0', 'multiqc')
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/pkg_resources/__init__.py", line 661, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1441, in run_script
    exec(code, namespace, namespace)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/multiqc-1.0.dev0-py2.7.egg/EGG-INFO/scripts/multiqc", line 581, in <module>
    multiqc()
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/multiqc-1.0.dev0-py2.7.egg/EGG-INFO/scripts/multiqc", line 448, in multiqc
    os.makedirs(config.data_dir)
  File "/opt/conda/envs/bio/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/multiqc_data'

I haven't seen it before. It didn't cause the pipeline to crash, though.

@olegs
Copy link
Contributor

olegs commented Oct 16, 2018

Same here:

Batch fastqc /data/fastq
LOCAL running TASK: /tmp/qsub.CYOTpPNxNKFt.sh LOG: /data/fastq/SRR6929776_1_fastqc.log
FILE: fastq/./SRR6929776_1.fastq; TASK:
LOCAL running TASK: /tmp/qsub.hS2m7CfGimDI.sh LOG: /data/fastq/SRR6929777_1_fastqc.log
FILE: fastq/./SRR6929777_1.fastq; TASK:
LOCAL running TASK: /tmp/qsub.iPUpd7ADaDcc.sh LOG: /data/fastq/SRR6929776_2_fastqc.log
FILE: fastq/./SRR6929776_2.fastq; TASK:
LOCAL running TASK: /tmp/qsub.1gS4UaCT7Nfv.sh LOG: /data/fastq/SRR6929777_2_fastqc.log
FILE: fastq/./SRR6929777_2.fastq; TASK:
LOCAL waiting for tasks...
Done. LOCAL waiting for tasks
Processing multiqc for: /data/fastq
[INFO   ]         multiqc : This is MultiQC v1.0.dev0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '/data/fastq/fastqc'
[INFO   ]          fastqc : Found 4 reports
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete
Done. Batch fastqc /data/fastq
[INFO   ]         multiqc : This is MultiQC v1.0.dev0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '/data/fastq'
[INFO   ]          fastqc : Found 4 reports
[INFO   ]         multiqc : Report      : multiqc_report.html
Traceback (most recent call last):
  File "/opt/conda/envs/bio/bin/multiqc", line 4, in <module>
    __import__('pkg_resources').run_script('multiqc==1.0.dev0', 'multiqc')
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/pkg_resources/__init__.py", line 661, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1441, in run_script
    exec(code, namespace, namespace)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/multiqc-1.0.dev0-py2.7.egg/EGG-INFO/scripts/multiqc", line 581, in <module>
    multiqc()
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/envs/bio/lib/python2.7/site-packages/multiqc-1.0.dev0-py2.7.egg/EGG-INFO/scripts/multiqc", line 448, in multiqc
    os.makedirs(config.data_dir)
  File "/opt/conda/envs/bio/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/multiqc_data'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants