Skip to content

Commit

Permalink
Make getting started section more concise (#90)
Browse files Browse the repository at this point in the history
* Make getting started section more concise

* Update user guide

* Update getting started section
  • Loading branch information
grst authored Jan 27, 2025
1 parent b79814b commit 09eaf78
Show file tree
Hide file tree
Showing 7 changed files with 86 additions and 66 deletions.
136 changes: 70 additions & 66 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# Getting started

This section will guide you through the most important features of dso and show how to work with a dso project.

## `dso init` -- Initialize a project

`dso init` initializes a new project in your current directory. In the context of DSO, a project is a structured environment where data science workflows are organized and managed.
`dso init` initializes a new project in your current directory. In the context of DSO, a project is a structured
folder where input files, scripts and output files are organized and managed.

To initialize a project use the following command:

Expand All @@ -15,11 +18,11 @@ It creates the root directory of your project with all the necessary configurati

## `dso create` -- Add folders or stages to your project

We consider a _stage_ an individual step in your analysis, usually a script with defined inputs and outputs.
We consider a _stage_ an individual step in your analysis, usually a script/notebook with defined inputs and outputs.
Stages can be organized in _folders_ with arbitrary structures. `dso create` initializes folders and stages
from predefined templates. We recommend naming stages with a numeric prefix, e.g. `01_` to declare the
order of scripts, but this is not a requirement. Currently, two stage templates have been implemented that
use either a quarto document or bash script to conduct the analysis.
order of scripts, but this is not a requirement. Currently, two stage templates are available that
use either a quarto document or a bash script to conduct the analysis.

```bash
cd test_project
Expand Down Expand Up @@ -47,22 +50,28 @@ stage

## Configuration files

The config files in a _project_, _folder_, or _stage_ are the cornerstone of any reproducible analysis, serving as a single point of truth. Additionally, using config files reduces the modification time needed for making _project_/_folder_-wide changes.
YAML-based config files in a _project_, _folder_, or _stage_ serve as a single point of truth for all input files, output files or parameters.
For this purpose, configurations can be defined at each level of your project in a `params.in.yaml` file.
Using `dso compile-config` the params.in.yaml files are compiled into `params.yaml` with two features:

Config files are designed to contain all necessary parameters, input, and output files that should be consistent across the analyses. For this purpose, configurations can be defined at each level of your project in a `params.in.yaml` file. These configurations are then transferred into the `params.yaml` files when using `dso compile-config`.
- _inheritance_: All variables defined in `params.in.yaml` files in any parent directory will be included.
- _templating_: Variables can be composed using [jinja2 syntax](https://jinja.palletsprojects.com/en/stable/templates/#variables), e.g. `foo: "{{ bar }}_version2"`.

A `params.yaml` file consolidates configurations from `params.in.yaml` files located in its parent directories, as well as from the `params.in.yaml` file in its own directory. For your analysis, reading in the `params.yaml` of the respective stage gives you then access to all the configurations.
Therefore, you only need to read in a single `params.yaml` file in each stage.

The following diagram displays the inheritance of configurations:

```{eval-rst}
.. image:: ../img/dso-yaml-inherit.png
:width: 60%
:width: 80%
```

### Writing configuration files
<p></p>
Variables in defined in subfolders override those defined in a parent directory. To ensure that, despite inheritance,
paths are always relative to each compiled `params.yaml` file, relative paths need to be preceded with `!path`.

To define your configurations in the `params.in.yaml` files, please adhere to the yaml syntax. Due to the implemented configuration inheritance, relative paths need to be resolved within each **folder** or **stage**. Therefore, relative paths need to be specified with `!path`.
### Example

An example `params.in.yaml` can look as follows:

Expand All @@ -73,81 +82,63 @@ thresholds:
p_adjusted: 0.1

samplesheet: !path "01_preprocessing/input/samplesheet.txt"

metadata_file: !path "metadata/metadata.csv"

file_with_abs_path: "/data/home/user/typical_analysis_data_set.csv"

remove_outliers: true

exclude_samples:
- sample_1
- sample_2
- sample_6
- sample_42
```

### Compiling `params.yaml` files

All `params.yaml` files are automatically generated using:

```bash
dso compile-config
```

### Overwriting Parameters

When multiple `params.in.yaml` files (such as those at the project, folder, or stage level) contain the same configuration, the value specified at the more specific level (e.g., stage) takes precedence over the value set at the broader level (e.g., project). This makes the analysis adaptable and enhances modifiability across the project.

## Implementing a stage

A stage is a single step in your analysis and usually generates some kind of output data from input data. The input data can also be supplied by previous stages. To create a stage, use the `dso create stage` command and select either the _bash_ or _quarto_ template as a starting-point.

The essential files of a stage are:

- `dvc.yaml`: The DVC configuration file that defines your data pipelines, dependencies, and outputs.
- `params.yaml`: Auto-generated configuration file.
- `params.yaml`: Auto-generated configuration file (never modify this by hand!).
- `params.in.yaml`: Modifiable configuration file containing stage-specific configurations.
- `src/<stage_name>.qmd`(optional): A Quarto file containing your script that runs the analysis for this stage.
- `src/<stage_name>.qmd` (when using `quarto` template): A Quarto file containing your script that runs the analysis for this stage.

### dvc.yaml

The `dvc.yaml` file contains information about the parameters, inputs, outputs, and commands used and executes in your stage.

#### Configuring the `dvc.yaml`

Configurations stored in the `params.yaml` of a stage can be directly used within the `dvc.yaml`:
Variables stored in the `params.yaml` of a stage can be directly used within the `dvc.yaml`:

```bash
stages:
01_preprocessing:
# Parameters used in this stage, defined in params.yaml
# Parameters used in this stage, defined in params.yaml (NO need for ${...}!!)
params:
- dso
- thresholds
# Dependencies required for this stage, can be defined in the params.yaml (define with ${...})
# Dependencies required for this stage, can be defined in params.yaml (use with ${...})
deps:
- src/01_preprocessing.qmd
- ${file_with_abs_path}
- ${samplesheet}
- ${ file_with_abs_path }
- ${ samplesheet }

# Outputs generated by this stage
# Outputs generated by this stage, can be defined in params.yaml (use ${ ... })
outs:
- output
- report/01_preprocessing.html
```

### Quarto Stage
#### Quarto Stage

By default, a Quarto stage includes the following cmd in the `dvc.yaml` file:
By default, a Quarto stage includes the following `cmd` section in the `dvc.yaml` file:

```
# Command to render the Quarto script and move the HTML report to the report folder
cmd:
- dso exec quarto .
```

### Bash Stage
#### Bash Stage

A Bash stage, by default, does not include an additional script. Bash code can be directly embedded in the `dvc.yaml` file:

Expand All @@ -161,19 +152,22 @@ A Bash stage, by default, does not include an additional script. Bash code can b
EOF
```

### Accessing Files and Configurations with R and Python

You can easily access files and configurations using either the DSO R-package or the Python module.
### Accessing files and parameters from R or Python

A convenient way of accessing files and configurations of your is to use the DSO R-package or the Python module.
You can easily access files and configurations using either the DSO R-package or the Python module:

For Python, refer to the [Python Usage Page](python_usage.md).

For R, refer to the [R Package Page](https://boehringer-ingelheim.github.io/dso-r/).
- For Python, refer to the [Python Usage Page](python_usage.md).
- For R, refer to the [R Package Page](https://boehringer-ingelheim.github.io/dso-r/).

## `dso repro` -- Reproducing all stages

To execute or reproduce a stage, folder, or project use `dso repro`. `dso repro` is a wrapper around `dvc repro` and builds the config files before reproducing the complete or a part of the analyses pipeline.
To execute (or _reproduce_, in dvc-speak) all scripts in a project, use `dso repro`.
`dso repro` is a wrapper around `dvc repro` that

- compiles all configuration files
- computes the checkums of all input files and compares them against the previously executed version
- executes all stages with modified inputs
- generates `dvc.lock` files that document all input files, output files and their versions.

Several command options are available and are detailed in the [dvc repro documentation](https://dvc.org/doc/command-reference/repro). The most common usages are detailed below:

Expand All @@ -194,29 +188,23 @@ dso repro -s subfolder/my_stage/dvc.yaml
dso repro -s -f subfolder/my_stage/dvc.yaml
```

## Syncing Changes with Remote
## Tracking and syncing files with DVC

To ensure your data and code are synchronized with the remote storage and repository, follow these steps:
Code is tracked with git as with any other project, while data is tracked by `dvc` and synced with a [dvc remote](https://dvc.org/doc/user-guide/data-management/remote-storage#remote-storage). DVC stores `.dvc` or `dvc.lock` files in the git repository
that reference file versions associated with each commit.

### Add a Remote Data Storage

To ensure you can always revert to previous data versions, add a remote storage for your DVC-controlled data. Use the `dvc remote add` command to specify a remote directory where the version-controlled files will be stored.

We recommend creating a directory in a suitable long-term storage location. Use the `-d` (default) option of `dvc remote add` to set this directory as the default remote storage:

```bash
# Create a directory for storing version-controlled files
mkdir /long/term/storage/project1/DVC_STORAGE
```{eval-rst}
.. image:: ../img/dso-dvc-remotes.png
:width: 80%
# Execute within the project directory to define the remote storage
dvc remote add -d <remote_name> /long/term/storage/project1/DVC_STORAGE
```

### Track Data with DVC

In your DSO project, all outputs are automatically controlled by DVC, ensuring that your data is versioned and managed efficiently. This setup helps maintain reproducibility and consistency across your analysis.
<p></p>

When you have input data that was not generated within your pipeline, you need to add them to your DSO project. Use `dvc add` to track such files with DVC. This command creates an associated `.dvc` file and automatically appends the tracked file to `.gitignore`. The `.dvc` file acts as a placeholder for the original file and should be tracked by Git.
In your DSO project, all outputs are automatically controlled by DVC, ensuring that your data is versioned.
When you have input data that was not generated within your pipeline, you need to add them to your DSO project.
Use `dvc add` to track such files with DVC. This command creates an associated `.dvc` file and automatically appends
the tracked file to `.gitignore`. The `.dvc` file acts as a placeholder for the original file and should be tracked by git.

This command is particularly useful when data is generated outside of your DSO project but is used within your analysis, such as metadata or preprocessed data.

Expand All @@ -228,9 +216,25 @@ dvc add <directoryname/filename>
dvc add metadata/external_clinical_annotation.csv
```

### Push Changes to Remote
### Syncing data with a remote

To ensure your collaborators can access files you added or results you generated, you need to sync
data with a [dvc remote](https://dvc.org/doc/user-guide/data-management/remote-storage#remote-storage). DVC supports
many backends, for instance S3 buckets, or a folder on a shared file system.

Use the `dvc remote add` command to specify a remote directory where the version-controlled files will be stored.
By adding the `-d` (default) option, dvc sets this directory as the default remote storage:

```bash
# Create a directory for storing version-controlled files
mkdir /path/on/shared/filesystem/project1/DVC_STORAGE

# Execute within the project directory to define the remote storage
dvc remote add -d <remote_name> /path/on/shared/filesystem/project1/DVC_STORAGE
```

After tracking your data with DVC and committing your changes locally, you need to push these changes to both the remote storage and your Git repository. This ensures that your data and metadata are safely backed up and accessible to collaborators.
After tracking your data with DVC and committing your changes locally, you need to push these changes to both
the remote storage and your Git repository.

Here’s how to do it:

Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
:caption: User Guide
getting_started.md
user_guide/config_files.md
```

```{toctree}
Expand Down
15 changes: 15 additions & 0 deletions docs/user_guide/config_files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Configuration files

TODO

## Compiling `params.yaml` files

All `params.yaml` files are automatically generated using:

```bash
dso compile-config
```

## Overwriting Parameters

When multiple `params.in.yaml` files (such as those at the project, folder, or stage level) contain the same configuration, the value specified at the more specific level (e.g., stage) takes precedence over the value set at the broader level (e.g., project). This makes the analysis adaptable and enhances modifiability across the project.
Binary file added img/dso-dvc-remotes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/dso-yaml-inherit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/dso_tools.pptx
Binary file not shown.
Binary file added img/figures.pptx
Binary file not shown.

0 comments on commit 09eaf78

Please sign in to comment.