Skip to content

Commit

Permalink
Update user guide (#93)
Browse files Browse the repository at this point in the history
* Update template documentation

* Add linting section

* Add UV documentation

* WIP params_files

* Update params file user guide
  • Loading branch information
grst authored Jan 28, 2025
1 parent 92ce961 commit 8c1d840
Show file tree
Hide file tree
Showing 9 changed files with 236 additions and 50 deletions.
3 changes: 2 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@
# Add any Sphinx extension module names here, as strings.
# They can be extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = [
"myst_nb",
"myst_parser",
"sphinx_design",
"sphinx_copybutton",
"sphinx.ext.autodoc",
"sphinx.ext.intersphinx",
Expand Down
48 changes: 16 additions & 32 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,54 +44,38 @@ Stages have the following pre-defined folder-structure. This folder system aims
stage
|-- input # contains Input Data
|-- src # contains Analysis Script(s)
|-- output # contains TLF - Outputs generated by Analysis Scripts
|-- output # contains outputs generated by Analysis Scripts
|-- report # contains HTML Report generated by Analysis Scripts
```

For more information about project and stage templates, click [here](user_guide/templates.md).

## Configuration files

YAML-based config files in a _project_, _folder_, or _stage_ serve as a single point of truth for all input files, output files or parameters.
For this purpose, configurations can be defined at each level of your project in a `params.in.yaml` file.
Using `dso compile-config` the params.in.yaml files are compiled into `params.yaml` with two features:
DSO proposes to put all parameters for your analysis (e.g. paths to input or output files, thresholds, etc)
in YAML-base configuration files. You can add a `params.in.yaml` at any level of a dso project (_project_, _folder_ or _stage_).

- _inheritance_: All variables defined in `params.in.yaml` files in any parent directory will be included.
- _templating_: Variables can be composed using [jinja2 syntax](https://jinja.palletsprojects.com/en/stable/templates/#variables), e.g. `foo: "{{ bar }}_version2"`.
By running

```bash
dso compile-config
```

Therefore, you only need to read in a single `params.yaml` file in each stage.
`params.in.yaml` files will be _compiled_ into `params.yaml` files. Compilation offers the following advantages:

The following diagram displays the inheritance of configurations:
- _inheritance_: All variables defined in `params.in.yaml` files in any parent directory will be included.
- _templating_: Variables can be composed using [jinja2 syntax](https://jinja.palletsprojects.com/en/stable/templates/#variables), e.g. `foo: "{{ bar }}_version2"`.
- _path resolving_: Paths will be always relative to each compiled `params.yaml` file, no matter where they were defined.

```{eval-rst}
.. image:: ../img/dso-yaml-inherit.png
.. image:: img/dso-yaml-inherit.png
:width: 80%
```

<p></p>
Variables in defined in subfolders override those defined in a parent directory. To ensure that, despite inheritance,
paths are always relative to each compiled `params.yaml` file, relative paths need to be preceded with `!path`.

### Example

An example `params.in.yaml` can look as follows:

```bash
thresholds:
fc: 2
p_value: 0.05
p_adjusted: 0.1

samplesheet: !path "01_preprocessing/input/samplesheet.txt"
metadata_file: !path "metadata/metadata.csv"
file_with_abs_path: "/data/home/user/typical_analysis_data_set.csv"

remove_outliers: true

exclude_samples:
- sample_1
- sample_6
- sample_42
```
For more details, please refer to [Configuration files](user_guide/params_files.md).

## Implementing a stage

Expand Down
1 change: 0 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
getting_started.md
user_guide/templates.md
user_guide/params_files.md
user_guide/accessing_stage_configs.md
user_guide/dvc.md
user_guide/pre_commit.md
user_guide/uv.md
Expand Down
5 changes: 0 additions & 5 deletions docs/user_guide/accessing_stage_configs.md

This file was deleted.

16 changes: 16 additions & 0 deletions docs/user_guide/linting.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
# Linting

Linting is a static file analysis designed to detect common pitfalls with dso/dvc projects. The DSO linter
needs further development. Please check [#5](https://github.com/Boehringer-Ingelheim/dso/issues/5) for the progress.

## Configuration

TODO

## Linting rules

```{eval-rst}
.. module:: dso._lint
.. autosummary::
:toctree: ../generated
DSO001
```
166 changes: 159 additions & 7 deletions docs/user_guide/params_files.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,169 @@
# Params files
# Configuration files

This section is the reference guide for configuration files. Explain inheritance, jinja 2 etc. in detail.
YAML-based config files in a _project_, _folder_, or _stage_ serve as a single point of truth for all input files, output files or parameters.
For this purpose, configurations can be defined at each level of your project in a `params.in.yaml` file.
Using `dso compile-config` the `params.in.yaml` files are compiled into `params.yaml` with the following features:

TODO
- _inheritance_: All variables defined in `params.in.yaml` files in any parent directory will be included.
- _templating_: Variables can be composed using [jinja2 syntax](https://jinja.palletsprojects.com/en/stable/templates/#variables), e.g. `foo: "{{ bar }}_version2"`.
- _path resolving_: Paths will be always relative to each compiled `params.yaml` file, no matter where they were defined.

## Compiling `params.yaml` files
Therefore, you only need to [read in](#accessing-stage-config) a single `params.yaml` file in each stage.

All `params.yaml` files are automatically generated using:
## Compiling configuration files

To generate a `params.yaml` file for each `params.in.yaml` file, use:

```bash
dso compile-config
```

## Overwriting Parameters
`params.yaml` files are not tracked by git. Never modify a `params.yaml` file by hand, it will be overwritten.
In folders without a `params.in.yaml` file, no `params.yaml` file will be generated.

## Inheritance

The following diagram displays the inheritance of configurations:

```{eval-rst}
.. image:: ../img/dso-yaml-inherit.png
:width: 80%
```

<p></p>

DSO leverages [hiyapyco](https://github.com/zerwes/hiyapyco) with `method=METHOD_MERGE` and `none_behavior=NONE_BEHAVIOR_OVERRIDE`
to implement inheritance. This means

- Values in a `params.in.yaml` file at a deeper level (e.g. stage) take precedence over values in a parent folder.
- Values are added existing lists
- Dictionary entried are added to existing dictionaries
- To exclude an inherited parameter, set the variable to `null`.

## Templating

Templating is again implemented in [hiyapyco](https://github.com/zerwes/hiyapyco) using the `interpolate=True` flag.
This allows variable to be composed using [jinja2 syntax](https://jinja.palletsprojects.com/en/stable/templates/#variables), e.g. `foo: "{{ bar }}_version2"`.

## Defining paths

To ensure that, despite inheritance, paths are always relative to each compiled `params.yaml` file, relative paths need to be preceded with `!path`, e.g.:

```yaml
samplesheet: !path "01_preprocessing/input/samplesheet.txt"
```
DSO supports compiling paths into absolute and relative paths. Relative paths are relative to the location of
each compiled `params.yaml` file. By default, DSO uses relative paths. To enable absolute paths, see
[configuration](../cli_configuration.md#project-specific-settings----pyprojecttoml). To learn
how to work with relative paths in Python/R scripts see [python usage](../python_usage.md) and [R usage](https://boehringer-ingelheim.github.io/dso-r/).

## Example

Let's consider a project which has the following two `params.in.yaml` files at the project root
and in a stage subfolder.

::::{grid} 1 1 2 2

:::{grid-item-card} `/params.in.yaml`

```yaml
thresholds:
fc: 2
p_value: 0.05
metadata_file: !path "metadata/metadata.csv"
dataset_name: typical_analysis
file_with_abs_path: "/data/home/user/{{ dataset_name }}_data_set.csv"
remove_outliers: true
exclude_samples:
- sample_1
- sample_6
```

:::

:::{grid-item-card} `/stage/params.in.yaml`

```yaml
thresholds:
fc: 3
p_adjusted: 0.1
samplesheet: !path "01_preprocessing/input/samplesheet.txt"
remove_outliers: null
exclude_samples:
- sample_42
```

:::
::::

This results in the following **compiled `params.yaml` files**:

::::{grid} 1 1 2 2

:::{grid-item-card} `/params.yaml`

```yaml
thresholds:
fc: 2
p_value: 0.05
metadata_file: metadata/metadata.csv
dataset_name: typical_analysis
file_with_abs_path: /data/home/user/typical_analysis_data_set.csv
remove_outliers: true
exclude_samples:
- sample_1
- sample_6
```

:::
:::{grid-item-card} `/stage/params.yaml`

```yaml
thresholds:
fc: 3
p_value: 0.05
p_adjusted: 0.1
metadata_file: ../metadata/metadata.csv
dataset_name: typical_analysis
file_with_abs_path: /data/home/user/typical_analysis_data_set.csv
remove_outliers:
exclude_samples:
- sample_1
- sample_6
- sample_42
samplesheet: 01_preprocessing/input/samplesheet.txt
```

:::
::::

## Accessing stage config

To ensure that `dso` correctly reruns stages when dependencies have changed, it is really important
to declare all input files/params in `dvc.yaml`. `dso compile-config` generates `params.yaml` files that,
in principle, you can read in with a YAML parser in a programming language of your choice.
However, we **recommend that you use one of the following interfaces to access the stage configuration**.
These interfaces ensure that you will have access only to the parameters declared in the `dvc.yaml` file as
either input, parameter, or output. This ensure that you cannot forget to declare a parameter that you actually
use in your analysis.

- `read_params` [in R](https://boehringer-ingelheim.github.io/dso-r/)
- `read_params` [in Python](../python_usage.md)
- `dso get-config` from the terminal.

When multiple `params.in.yaml` files (such as those at the project, folder, or stage level) contain the same configuration, the value specified at the more specific level (e.g., stage) takes precedence over the value set at the broader level (e.g., project). This makes the analysis adaptable and enhances modifiability across the project.
`dso get-config` prints the filtered params file for a given stage to STDOUT. This makes it really easy to
call it from other languages as a system call. In fact, this is what `read_params` in R and Python are doing under the hood.
5 changes: 3 additions & 2 deletions docs/user_guide/templates.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Project and stage templates

DSO provides a templating engine that allows to quickly bootrstap a project (`dso init`), folder, or stages (`dso create`).
DSO provides a templating engine that allows to [quickly boostrap a project](../getting_started.md#dso-init----initialize-a-project)
(`dso init`), folder, or stages (`dso create`).
Templates are based on [jinja2](https://jinja.palletsprojects.com/en/stable/templates/).

## Available templates
Expand Down Expand Up @@ -28,7 +29,7 @@ Universal license.
## Using custom template libraries

Currently, dso only supports the internal templates mentioned above. However, we plan to add support to custom
stage templates soon. This enables some interesting use-cases:
stage templates soon ([#9](https://github.com/Boehringer-Ingelheim/dso/issues/9)). This enables some interesting use-cases:

- Organization-specific templates: Use templates that make it easier to comply with internal processes or apply
corporate design.
Expand Down
39 changes: 38 additions & 1 deletion docs/user_guide/uv.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,40 @@
# `uv` integration

TODO
[`uv`](https://docs.astral.sh/uv/) is an ultrafast python package and project manager. Every dso
project is also a `uv` project, so you can use all the features described in the uv [working with projects](https://docs.astral.sh/uv/guides/projects/)
documentation.

Integration with `uv` serves two main purposes:

- freeze the version of `dso` per project to ensure reproducibility in the future, even if dso behavior changes.
This features is a work-in-progress, see also [installation](../cli_installation.md#freezing-the-dso-version-within-a-project).
- Provide a python virtual environment for all python stages in the project.

Using a separate virtual environment for each project is considered good practice to ensure reproducibility and
to avoid dependency conflicts. `uv` makes this very easy.

To add dependencies, edit the `dependencies` section in `pyproject.toml` or use

```bash
uv add <some_package>
```

to install it.

By using

```bash
uv sync
```

all requested packages are installed into the local `.venv` directory. At the same time a `uv.lock` file
is created that pins the exact versions of each package. This file is tracked by `.git`, which means
every collaborator will get exactly the same environment if they run `uv sync` on their machine.

To run a script within the virtual environment, use

```bash
uv run ./some_script.py
```

All DSO Python stages use the virtual environment by default.
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,12 @@ optional-dependencies.doc = [
"ipykernel",
"ipython",
"matplotlib",
"myst-nb>=1.1",
"myst-parser",
"sphinx>=4",
"sphinx-autodoc-typehints",
"sphinx-book-theme>=1",
"sphinx-copybutton",
"sphinx-design",
"sphinxcontrib-bibtex>=1",
"sphinxext-opengraph",
]
Expand Down

0 comments on commit 8c1d840

Please sign in to comment.