mckinsey · antonymilne · Feb 12, 2025 · Feb 8, 2025 · Feb 8, 2025 · Feb 10, 2025
@@ -0,0 +1,39 @@
+<!--
+A new scriv changelog fragment.
+
+Uncomment the section that is right (remove the HTML comment wrapper).
+-->
+
+<!--
+### Highlights ✨
+
+- A bullet item for the Highlights ✨ category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+<!--
+### Removed
+
+- A bullet item for the Removed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+### Added
+
+- Kedro integration function `datasets_from_catalog` can now handle dataset factories. ([#1001](https://github.com/mckinsey/vizro/pull/1001))
+
+### Changed
+
+- Bump optional dependency lower bound to `kedro>=0.19.9`. ([#1001](https://github.com/mckinsey/vizro/pull/1001))
+
+<!--
+### Deprecated
+
+- A bullet item for the Deprecated category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+
+<!--
+### Security
+
+- A bullet item for the Security category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
@@ -10,7 +10,7 @@
 
 <!-- vale off -->
 
-[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).
+[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Guillaume Tauzin](https://github.com/gtauzin), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).
 
 with thanks to Sam Bourton and Kevin Staight for sponsorship, inspiration and guidance,
 

@@ -95,7 +95,7 @@ Any attempt at a high-level explanation must rely on an oversimplification that
 
 All are great entry points to the world of data apps. If you prefer a top-down scripting style, then Streamlit is a powerful approach. If you prefer full control and customization over callbacks and layouts, then Dash is a powerful approach. If you prefer a configuration approach with in-built best practices, and the potential for customization and scalability through Dash, then Vizro is a powerful approach.
 
-For a more detailed comparison, it may help to visit the introductory articles of [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose, and could be the best tool of choice.
+For a more detailed comparison, it may help to read introductory articles about [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://blog.streamlit.io/streamlit-101-python-data-app/) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose.
 
 ## How does Vizro compare with Python packages and business intelligence (BI) tools?
 

@@ -12,7 +12,7 @@ pip install vizro[kedro]
 
 ## Use datasets from the Kedro Data Catalog
 
-`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
+`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
 
 ```python
 from vizro.integrations import kedro as kedro_integration
@@ -39,20 +39,21 @@ The full code for these different cases is given below.
         from vizro.integrations import kedro as kedro_integration
         from vizro.managers import data_manager
 
+        project_path = "/path/to/kedro/project"
+        catalog = kedro_integration.catalog_from_project(project_path)
 
-        catalog = kedro_integration.catalog_from_project("/path/to/kedro/project")
 
-        for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
-            data_manager[dataset_name] = dataset
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
+            data_manager[dataset_name] = dataset_loader
         ```
 
     === "app.ipynb (Kedro Jupyter session)"
         ```python
         from vizro.managers import data_manager
 
 
-        for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
-            data_manager[dataset_name] = dataset
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
+            data_manager[dataset_name] = dataset_loader
         ```
 
     === "app.py (Data Catalog configuration file)"
@@ -66,6 +67,51 @@ The full code for these different cases is given below.
 
         catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))
 
-        for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
-            data_manager[dataset_name] = dataset
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
+            data_manager[dataset_name] = dataset_loader
+        ```
+
+### Use dataset factories
+
+To add datasets that are defined using a [Kedro dataset factory](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), `datasets_from_catalog` needs to resolve dataset patterns against explicit datasets. Given a Kedro `pipelines` dictionary, you should specify a `pipeline` argument as follows:
+
+```python
+kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__"])  # (1)!
+```
+
+1. You can specify the name of your pipeline, for example `pipelines["my_pipeline"]`, or even combine multiple pipelines with `pipelines["a"] + pipelines["b"]`. The Kedro `__default__` pipeline is what runs by default with the `kedro run` command.
+
+The `pipelines` variable may have been created the following ways:
+
+1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
+1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.
+
+The full code for these different cases is given below.
+
+!!! example "Import a Kedro Data Catalog with dataset factories into the Vizro data manager"
+    === "app.py (Kedro project path)"
+        ```python
+        from vizro.integrations import kedro as kedro_integration
+        from vizro.managers import data_manager
+
+
+        project_path = "/path/to/kedro/project"
+        catalog = kedro_integration.catalog_from_project(project_path)
+        pipelines = kedro_integration.pipelines_from_project(project_path)
+
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
+            catalog, pipeline=pipelines["__default__"]
+        ).items():
+            data_manager[dataset_name] = dataset_loader
+        ```
+
+    === "app.ipynb (Kedro Jupyter session)"
+        ```python
+        from vizro.managers import data_manager
+
+
+        for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
+            catalog, pipeline=pipelines["__default__"]
+        ).items():
+            data_manager[dataset_name] = dataset_loader
         ```
@@ -3,13 +3,6 @@
 [[envs.all.matrix]]
 python = ["3.9", "3.10", "3.11", "3.12", "3.13"]
 
-[envs.all.overrides]
-# Kedro is currently not compatible with Python 3.13 and returns exceptions when trying to run the unit tests on
-# Python 3.13. These exceptions turned out to be difficult to ignore: https://github.com/mckinsey/vizro/pull/216
-matrix.python.features = [
-  {value = "kedro", if = ["3.9", "3.10", "3.11", "3.12"]}
-]
-
 [envs.changelog]
 dependencies = ["scriv"]
 detached = true
@@ -37,6 +30,7 @@ dependencies = [
   "pyhamcrest",
   "gunicorn"
 ]
+features = ["kedro"]
 installer = "uv"
 
 [envs.default.env-vars]
@@ -133,9 +127,9 @@ extra-dependencies = [
   "dash==2.18.0",
   "plotly==5.24.0",
   "pandas==2.0.0",
-  "numpy==1.23.0"  # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
+  "numpy==1.23.0",  # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
+  "kedro==0.19.9"
 ]
-features = ["kedro"]
 python = "3.9"
 
 [publish.index]

@@ -36,7 +36,7 @@ requires-python = ">=3.9"
 
 [project.optional-dependencies]
 kedro = [
-  "kedro>=0.17.3",
+  "kedro>=0.19.9",
   "kedro-datasets"  # no longer a dependency of kedro for kedro>=0.19.2
 ]
 

@@ -1,3 +1,3 @@
-from ._data_manager import catalog_from_project, datasets_from_catalog
+from ._data_manager import catalog_from_project, datasets_from_catalog, pipelines_from_project
 
-__all__ = ["catalog_from_project", "datasets_from_catalog"]
+__all__ = ["catalog_from_project", "datasets_from_catalog", "pipelines_from_project"]
@@ -3,25 +3,49 @@
 
 from kedro.framework.session import KedroSession
 from kedro.framework.startup import bootstrap_project
-from kedro.io import DataCatalog
+from kedro.io import CatalogProtocol
+from kedro.pipeline import Pipeline
 
 from vizro.managers._data_manager import pd_DataFrameCallable
 
 
 def catalog_from_project(
     project_path: Union[str, Path], env: Optional[str] = None, extra_params: Optional[dict[str, Any]] = None
-) -> DataCatalog:
+) -> CatalogProtocol:
     bootstrap_project(project_path)
     with KedroSession.create(
         project_path=project_path, env=env, save_on_close=False, extra_params=extra_params
     ) as session:
         return session.load_context().catalog
 
 
-def datasets_from_catalog(catalog: DataCatalog) -> dict[str, pd_DataFrameCallable]:
-    datasets = {}
-    for name in catalog.list():
-        dataset = catalog._get_dataset(name, suggest=False)
-        if "pandas" in dataset.__module__:
-            datasets[name] = dataset.load
-    return datasets
+def pipelines_from_project(project_path: Union[str, Path]) -> Pipeline:
+    bootstrap_project(project_path)
+    from kedro.framework.project import pipelines
+
+    return pipelines
+
+
+def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None) -> dict[str, pd_DataFrameCallable]:
+    # This doesn't include things added to the catalog at run time but that is ok for our purposes.
+    config_resolver = catalog.config_resolver
+    kedro_datasets = config_resolver.config.copy()
+
+    if pipeline is not None:
+        # Go through all dataset names that weren't in catalog and try to resolve them. Those that cannot be
+        # resolved give an empty dictionary and are ignored.
+        for dataset_name in set(pipeline.datasets()) - set(kedro_datasets):
+            if dataset_config := config_resolver.resolve_pattern(dataset_name):
+                kedro_datasets[dataset_name] = dataset_config
+
+    vizro_data_sources = {}
+
+    for dataset_name, dataset_config in kedro_datasets.items():
+        # "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
+        if "pandas" in dataset_config["type"]:
+            # TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
+            #  but need to check if works with caching.
+            dataset = catalog._get_dataset(dataset_name, suggest=False)
+            vizro_data_sources[dataset_name] = dataset.load
+
+    return vizro_data_sources
@@ -1,7 +1,15 @@
-companies:
-  type: pandas.JSONDataset
-  filepath: companies.json
+"{pandas_factory}#csv":
+  type: pandas.CSVDataset
+  filepath: ./{pandas_factory}.csv
 
-reviews:
+pandas_excel:
+  type: pandas.ExcelDataset
+  filepath: pandas_excel.xlsx
+
+pandas_parquet:
+  type: pandas.ParquetDataset
+  filepath: pandas_parquet.parquet
+
+not_dataframe:
   type: pickle.PickleDataset
-  filepath: reviews.pkl
+  filepath: pickle.pkl
@@ -3,23 +3,47 @@
 import types
 from pathlib import Path
 
+import kedro.pipeline as kp
 import pytest
 import yaml
-
-kedro = pytest.importorskip("kedro")
-
-from kedro.io import DataCatalog  # noqa: E402
-
-from vizro.integrations.kedro import datasets_from_catalog  # noqa: E402
-
-
-@pytest.fixture
-def catalog_path():
-    return Path(__file__).parent / "fixtures/test_catalog.yaml"
-
-
-def test_datasets_from_catalog(catalog_path):
-    catalog = DataCatalog.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))
-    assert "companies" in datasets_from_catalog(catalog)
-    assert isinstance(datasets_from_catalog(catalog), dict)
-    assert isinstance(datasets_from_catalog(catalog)["companies"], types.MethodType)
+from kedro.io import DataCatalog, KedroDataCatalog
+
+from vizro.integrations.kedro import datasets_from_catalog
+
+
+@pytest.fixture(params=[DataCatalog, KedroDataCatalog])
+def catalog(request):
+    catalog_class = request.param
+    catalog_path = Path(__file__).parent / "fixtures/test_catalog.yaml"
+    return catalog_class.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))
+
+
+def test_datasets_from_catalog(catalog):
+    datasets = datasets_from_catalog(catalog)
+    assert isinstance(datasets, dict)
+    assert set(datasets) == {"pandas_excel", "pandas_parquet"}
+    for dataset in datasets.values():
+        assert isinstance(dataset, types.MethodType)
+
+
+def test_datasets_from_catalog_with_pipeline(catalog):
+    pipeline = kp.pipeline(
+        [
+            kp.node(
+                func=lambda *args: None,
+                inputs=[
+                    "pandas_excel",
+                    "something#csv",
+                    "not_dataframe",
+                    "not_in_catalog",
+                    "pandas_parquet",
+                    "parameters",
+                    "params:z",
+                ],
+                outputs=None,
+            ),
+        ]
+    )
+
+    datasets = datasets_from_catalog(catalog, pipeline=pipeline)
+    assert set(datasets) == {"pandas_excel", "pandas_parquet", "something#csv"}