Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Enable datasets_from_catalog to return factory-based datasets #1001

Merged
merged 17 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
<!--
A new scriv changelog fragment.

Uncomment the section that is right (remove the HTML comment wrapper).
-->

<!--
### Highlights ✨

- A bullet item for the Highlights ✨ category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
<!--
### Removed

- A bullet item for the Removed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
### Added

- Kedro integration function `datasets_from_catalog` can now handle dataset factories. ([#1001](https://github.com/mckinsey/vizro/pull/1001))

### Changed

- Bump optional dependency lower bound to `kedro>=0.19.9`. ([#1001](https://github.com/mckinsey/vizro/pull/1001))

<!--
### Deprecated

- A bullet item for the Deprecated category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->

<!--
### Security

- A bullet item for the Security category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
2 changes: 1 addition & 1 deletion vizro-core/docs/pages/explanation/authors.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

<!-- vale off -->

[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).
[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Guillaume Tauzin](https://github.com/gtauzin), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).

with thanks to Sam Bourton and Kevin Staight for sponsorship, inspiration and guidance,

Expand Down
2 changes: 1 addition & 1 deletion vizro-core/docs/pages/explanation/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ Any attempt at a high-level explanation must rely on an oversimplification that

All are great entry points to the world of data apps. If you prefer a top-down scripting style, then Streamlit is a powerful approach. If you prefer full control and customization over callbacks and layouts, then Dash is a powerful approach. If you prefer a configuration approach with in-built best practices, and the potential for customization and scalability through Dash, then Vizro is a powerful approach.

For a more detailed comparison, it may help to visit the introductory articles of [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose, and could be the best tool of choice.
For a more detailed comparison, it may help to read introductory articles about [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://blog.streamlit.io/streamlit-101-python-data-app/) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose.

## How does Vizro compare with Python packages and business intelligence (BI) tools?

Expand Down
62 changes: 54 additions & 8 deletions vizro-core/docs/pages/user-guides/kedro-data-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ pip install vizro[kedro]

## Use datasets from the Kedro Data Catalog

`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:

```python
from vizro.integrations import kedro as kedro_integration
Expand All @@ -39,20 +39,21 @@ The full code for these different cases is given below.
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager

project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)

catalog = kedro_integration.catalog_from_project("/path/to/kedro/project")

for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.ipynb (Kedro Jupyter session)"
```python
from vizro.managers import data_manager


for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.py (Data Catalog configuration file)"
Expand All @@ -66,6 +67,51 @@ The full code for these different cases is given below.

catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))

for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

### Use dataset factories

To add datasets that are defined using a [Kedro dataset factory](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), `datasets_from_catalog` needs to resolve dataset patterns against explicit datasets. Given a Kedro `pipelines` dictionary, you should specify a `pipeline` argument as follows:

```python
kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__"]) # (1)!
```

1. You can specify the name of your pipeline, for example `pipelines["my_pipeline"]`, or even combine multiple pipelines with `pipelines["a"] + pipelines["b"]`. The Kedro `__default__` pipeline is what runs by default with the `kedro run` command.

The `pipelines` variable may have been created the following ways:

1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.

The full code for these different cases is given below.

!!! example "Import a Kedro Data Catalog with dataset factories into the Vizro data manager"
=== "app.py (Kedro project path)"
```python
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager


project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)
pipelines = kedro_integration.pipelines_from_project(project_path)

for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
catalog, pipeline=pipelines["__default__"]
).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.ipynb (Kedro Jupyter session)"
```python
from vizro.managers import data_manager


for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
catalog, pipeline=pipelines["__default__"]
).items():
data_manager[dataset_name] = dataset_loader
```
12 changes: 3 additions & 9 deletions vizro-core/hatch.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,6 @@
[[envs.all.matrix]]
python = ["3.9", "3.10", "3.11", "3.12", "3.13"]

[envs.all.overrides]
# Kedro is currently not compatible with Python 3.13 and returns exceptions when trying to run the unit tests on
# Python 3.13. These exceptions turned out to be difficult to ignore: https://github.com/mckinsey/vizro/pull/216
matrix.python.features = [
{value = "kedro", if = ["3.9", "3.10", "3.11", "3.12"]}
]

[envs.changelog]
dependencies = ["scriv"]
detached = true
Expand Down Expand Up @@ -37,6 +30,7 @@ dependencies = [
"pyhamcrest",
"gunicorn"
]
features = ["kedro"]
installer = "uv"

[envs.default.env-vars]
Expand Down Expand Up @@ -133,9 +127,9 @@ extra-dependencies = [
"dash==2.18.0",
"plotly==5.24.0",
"pandas==2.0.0",
"numpy==1.23.0" # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
"numpy==1.23.0", # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
"kedro==0.19.9"
]
features = ["kedro"]
python = "3.9"

[publish.index]
Expand Down
2 changes: 1 addition & 1 deletion vizro-core/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ requires-python = ">=3.9"

[project.optional-dependencies]
kedro = [
"kedro>=0.17.3",
"kedro>=0.19.9",
"kedro-datasets" # no longer a dependency of kedro for kedro>=0.19.2
]

Expand Down
4 changes: 2 additions & 2 deletions vizro-core/src/vizro/integrations/kedro/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from ._data_manager import catalog_from_project, datasets_from_catalog
from ._data_manager import catalog_from_project, datasets_from_catalog, pipelines_from_project

__all__ = ["catalog_from_project", "datasets_from_catalog"]
__all__ = ["catalog_from_project", "datasets_from_catalog", "pipelines_from_project"]
42 changes: 33 additions & 9 deletions vizro-core/src/vizro/integrations/kedro/_data_manager.py
gtauzin marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,49 @@

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from kedro.io import DataCatalog
from kedro.io import CatalogProtocol
from kedro.pipeline import Pipeline

from vizro.managers._data_manager import pd_DataFrameCallable


def catalog_from_project(
project_path: Union[str, Path], env: Optional[str] = None, extra_params: Optional[dict[str, Any]] = None
) -> DataCatalog:
) -> CatalogProtocol:
bootstrap_project(project_path)
with KedroSession.create(
project_path=project_path, env=env, save_on_close=False, extra_params=extra_params
) as session:
return session.load_context().catalog


def datasets_from_catalog(catalog: DataCatalog) -> dict[str, pd_DataFrameCallable]:
datasets = {}
for name in catalog.list():
dataset = catalog._get_dataset(name, suggest=False)
if "pandas" in dataset.__module__:
datasets[name] = dataset.load
return datasets
def pipelines_from_project(project_path: Union[str, Path]) -> Pipeline:
bootstrap_project(project_path)
from kedro.framework.project import pipelines

return pipelines


def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None) -> dict[str, pd_DataFrameCallable]:
# This doesn't include things added to the catalog at run time but that is ok for our purposes.
config_resolver = catalog.config_resolver
kedro_datasets = config_resolver.config.copy()

if pipeline is not None:
gtauzin marked this conversation as resolved.
Show resolved Hide resolved
# Go through all dataset names that weren't in catalog and try to resolve them. Those that cannot be
# resolved give an empty dictionary and are ignored.
for dataset_name in set(pipeline.datasets()) - set(kedro_datasets):
if dataset_config := config_resolver.resolve_pattern(dataset_name):
kedro_datasets[dataset_name] = dataset_config

vizro_data_sources = {}

for dataset_name, dataset_config in kedro_datasets.items():
# "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
if "pandas" in dataset_config["type"]:
# TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
# but need to check if works with caching.
dataset = catalog._get_dataset(dataset_name, suggest=False)
vizro_data_sources[dataset_name] = dataset.load

return vizro_data_sources
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
companies:
type: pandas.JSONDataset
filepath: companies.json
"{pandas_factory}#csv":
type: pandas.CSVDataset
filepath: ./{pandas_factory}.csv
antonymilne marked this conversation as resolved.
Show resolved Hide resolved

reviews:
pandas_excel:
type: pandas.ExcelDataset
filepath: pandas_excel.xlsx

pandas_parquet:
type: pandas.ParquetDataset
filepath: pandas_parquet.parquet

not_dataframe:
type: pickle.PickleDataset
filepath: reviews.pkl
filepath: pickle.pkl
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,47 @@
import types
from pathlib import Path

import kedro.pipeline as kp
import pytest
import yaml

kedro = pytest.importorskip("kedro")

from kedro.io import DataCatalog # noqa: E402

from vizro.integrations.kedro import datasets_from_catalog # noqa: E402


@pytest.fixture
def catalog_path():
return Path(__file__).parent / "fixtures/test_catalog.yaml"


def test_datasets_from_catalog(catalog_path):
catalog = DataCatalog.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))
assert "companies" in datasets_from_catalog(catalog)
assert isinstance(datasets_from_catalog(catalog), dict)
assert isinstance(datasets_from_catalog(catalog)["companies"], types.MethodType)
from kedro.io import DataCatalog, KedroDataCatalog

from vizro.integrations.kedro import datasets_from_catalog


@pytest.fixture(params=[DataCatalog, KedroDataCatalog])
def catalog(request):
catalog_class = request.param
catalog_path = Path(__file__).parent / "fixtures/test_catalog.yaml"
return catalog_class.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))


def test_datasets_from_catalog(catalog):
datasets = datasets_from_catalog(catalog)
assert isinstance(datasets, dict)
assert set(datasets) == {"pandas_excel", "pandas_parquet"}
for dataset in datasets.values():
assert isinstance(dataset, types.MethodType)


def test_datasets_from_catalog_with_pipeline(catalog):
pipeline = kp.pipeline(
[
kp.node(
func=lambda *args: None,
inputs=[
"pandas_excel",
"something#csv",
"not_dataframe",
"not_in_catalog",
"pandas_parquet",
antonymilne marked this conversation as resolved.
Show resolved Hide resolved
antonymilne marked this conversation as resolved.
Show resolved Hide resolved
"parameters",
"params:z",
],
outputs=None,
),
]
)

datasets = datasets_from_catalog(catalog, pipeline=pipeline)
assert set(datasets) == {"pandas_excel", "pandas_parquet", "something#csv"}