Skip to content

Commit

Permalink
Update kedro data datalog integration docs
Browse files Browse the repository at this point in the history
  • Loading branch information
antonymilne committed Feb 12, 2025
1 parent 1fcb7d7 commit 542e04a
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 21 deletions.
15 changes: 14 additions & 1 deletion vizro-core/docs/pages/user-guides/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,20 @@ graph TD
| Can be refreshed while dashboard is running | No | Yes |
| Production-ready | Yes | Yes |

If you have a [Kedro](https://kedro.org/) project or would like to use the [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to manage your data independently of a Kedro project then you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md). This offers helper functions to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) as dynamic data in the Vizro data manager.
If you have a [Kedro](https://kedro.org/) project then you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md) to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) data to the Vizro data manager.

!!! note "Kedro Data Catalog as a data source registry"

Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](kedro-data-catalog.md#create-a-kedro-data-catalog) as a YAML registry of your dashboard's data sources. This separates configuration of your data sources from your app's code. Here is an example `catalog.yaml` file:

```yaml
motorbikes:
type: pandas.CSVDataset
filepath: s3://your_bucket/data/motorbikes.csv
load_args:
sep: ","
na_values: [NA]
```

## Static data

Expand Down
93 changes: 73 additions & 20 deletions vizro-core/docs/pages/user-guides/kedro-data-catalog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# How to integrate Vizro with Kedro Data Catalog
# How to integrate Vizro with the Kedro Data Catalog

This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. For Pandas datasets registered in a Kedro data catalog, Vizro provides a convenient way to visualize them.
This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. For Pandas datasets registered in a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html), Vizro provides a convenient way to visualize them.

Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](#create-a-kedro-data-catalog) to manage your dashboard's data sources.

## Installation

Expand All @@ -10,67 +12,118 @@ If you already have Kedro installed then you do not need to install any extra de
pip install vizro[kedro]
```

Vizro is currently compatible with `kedro>=0.19.0` and works with dataset factories for `kedro>=0.19.9`.

## Create a Kedro Data Catalog

You can create a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to be a YAML registry of your dashboard's data sources. If you have a Kedro project then you will already have this file; if you would like to use the Kedro Data Catalog outside a Kedro project then you need to create it. In this case, you should save your `catalog.yaml` file to the same directory as your `app.py`.

The Kedro Data Catalog separates configuration of your data sources from your app's code. Here is an example `catalog.yaml` file that illustrates some of the features of the Kedro Data Catalog.

```yaml
cars: # (1)!
type: pandas.CSVDataset # (2)!
filepath: cars.csv

motorbikes:
type: pandas.CSVDataset
filepath: s3://your_bucket/data/motorbikes.csv # (3)!
load_args: # (4)!
sep: ","
na_values: [NA]

trains:
type: pandas.ExcelDataset
filepath: trains.xlsx
load_args:
sheet_name: [Sheet1, Sheet2, Sheet3]

trucks:
type: pandas.ParquetDataset
filepath: trucks.parquet
load_args:
columns: [name, gear, disp, wt]
categories: list
index: name
```
1. The [minimum details needed](https://docs.kedro.org/en/stable/data/data_catalog.html#the-basics-of-catalog-yml) for a Kedro Data Catalog entry are the data source name (`cars`), the type of data (`type`), and the file's location (`filepath`).
2. Vizro supports all [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) datasets. This includes, for example, CSV, Excel and Parquet files.
3. Kedro supports a [variety of data stores](https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-filepath) including local file systems, network file systems and cloud object store.
4. You can [pass data loading arguments](https://docs.kedro.org/en/stable/data/data_catalog.html#load-save-and-filesystem-arguments) to specify how to load the data source.

For more details, refer to Kedro's [introduction to the Data Catalog](https://docs.kedro.org/en/stable/data/data_catalog.html) and their [collection of YAML examples](https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html).

## Use datasets from the Kedro Data Catalog

`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
Vizro provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) in the module `vizro.integrations.kedro`. This supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro `catalog`, the general pattern to add datasets to the Vizro data manager is:

```python
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

This imports all datasets of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) from the Kedro `catalog` into the Vizro `data_manager`.
This registers in the Vizro `data_manager` all data sources of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) from the Kedro `catalog`. You can now [reference the data source](data.md#reference-by-name) by name. For example, given the [above `catalog.yaml` file](#create-a-kedro-data-catalog), you could use the data source names `"cars"`, `"motorbikes"`, `"trains"` and `"trucks"`, for example with `px.scatter("cars", ...)`.

!!! note

Data sources imported from Kedro in this way are [dynamic data](data.md#dynamic-data). This means that the data can be refreshed while your dashboard is running. For example, if you execute a Kedro pipeline run then the latest data can be shown in your dashboard without restarting it.

The `catalog` variable may have been created in a number of different ways:

1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.catalog_from_project` to generate a `catalog` given the path to a Kedro project.
1. Data Catalog configuration file (`catalog.yaml`), [created as described above](#create-a-kedro-data-catalog). This generates a `catalog` variable independently of a Kedro project using [`DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).
1. Kedro project path. Vizro exposes a helper function `catalog_from_project` to generate a `catalog` given the path to a Kedro project.
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `catalog`.
1. Data Catalog configuration file (`catalog.yaml`). This can create a `catalog` entirely independently of a Kedro project using [`kedro.io.DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).

The full code for these different cases is given below.

!!! example "Import a Kedro Data Catalog into the Vizro data manager"
=== "app.py (Kedro project path)"
=== "app.py (Data Catalog configuration file)"
```python
from kedro.io import DataCatalog # (1)!
import yaml
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)
catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8"))) # (2)!
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.ipynb (Kedro Jupyter session)"
1. Kedro's [experimental `KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature) would also work.
2. The contents of `catalog.yaml` is [described above](#create-a-kedro-data-catalog).

=== "app.py (Kedro project path)"
```python
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.py (Data Catalog configuration file)"
=== "app.ipynb (Kedro Jupyter session)"
```python
from kedro.io import DataCatalog
import yaml

from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))

for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```



### Use dataset factories

To add datasets that are defined using a [Kedro dataset factory](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), `datasets_from_catalog` needs to resolve dataset patterns against explicit datasets. Given a Kedro `pipelines` dictionary, you should specify a `pipeline` argument as follows:
Expand All @@ -83,7 +136,7 @@ kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__

The `pipelines` variable may have been created the following ways:

1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
1. Kedro project path. Vizro exposes a helper function `pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.

The full code for these different cases is given below.
Expand Down

0 comments on commit 542e04a

Please sign in to comment.