[Feat] Enable datasets_from_catalog to return factory-based datasets #1001

gtauzin · 2025-02-08T10:56:51Z

Description

This PR adds the capabilty to the kedro integration to exploit datasets created using dataset factories.

Fixes #988.

Notice

I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
- I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
- I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
- I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.
- I have not referenced individuals, products or companies in any commits, directly or indirectly.
- I have not added data or restricted code in any commits, directly or indirectly.

Signed-off-by: Guillaume Tauzin <[email protected]>

gtauzin

@antonymilne Just a few comments on the implementation you suggested in #988.

Also, when I ran `hatch run test-unit, I did not see the kedro_integration tests in the list of tests collected by pytest. Not sure why, but happy to investifate if needed.

vizro-core/docs/pages/user-guides/kedro-data-catalog.md

vizro-core/src/vizro/integrations/kedro/_data_manager.py

Signed-off-by: Guillaume Tauzin <[email protected]>

gtauzin · 2025-02-08T18:27:12Z

The CI is failing because I added a polars dataset to the catalog and polars is not included in the dependencies.

This is an interesting edge case because in the case where I have a large number of datasets in my catalog and I want to load a few of them (all pandas-based) for a vizro dashboards, I should not be forced to include all dependencies related to all datasets in my catalog. Does it make sense? I feel that datasets_from_catalog should take a list of dataset names as a parameter so we can control which ones are instantiated.

vizro-core/docs/pages/user-guides/kedro-data-catalog.md

stichbury

LGTM from docs perspective, with just two minor tweaks. Thank you so much @gtauzin for this contribution 🌟🌟🌟🌟🌟

antonymilne

Thank you very much for your PR @gtauzin! This looks great 💯 I just looked through the docs so far (thanks very much for doing those as well!) and left a few comments but it generally looks great. I'll review the code and tests when I get a chance later today.

vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md

vizro-core/docs/pages/user-guides/kedro-data-catalog.md

vizro-core/src/vizro/integrations/kedro/_data_manager.py

antonymilne

Now finished reviewing the code and this looks great, thank you so much! Let me just answer some of your questions from #988 here since they're directly related to the PR:

Given it still seem there could be some uncertainties related to the future API of KedroDataCatalog should I write tests based on KedroDataCatalog catalogs (as well as DataCatalog catalogs)? That way they'll serve as a reminder to update the code if anything change.

This would be great if you can be bothered to do it but I'm not too fussed if you can't. The best way to do it would be to use a parametrized fixture to replace catalog_path fixture with something like this:

@pytest.fixture(params=[DataCatalog, KedroDataCatalog]
def catalog(request):
    catalog_class =  request.param
    catalog_path = Path(__file__).parent / "fixtures/test_catalog.yaml"
return catalog_class.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))

# for both tests
def test_...(catalog):
   # as now

Could you please help me double check your tests?

I made some code suggestions there but basically the intention is to test all (or at least most) possible cases of dataset definition in one test. Based on my suggestions we now have:
*pandas_excel: pandas dataset defined in catalog and used in pipeline

pandas_parquet: pandas dataset defined in catalog and not used in pipeline
something#csv: pandas dataset pattern used in pipeline
polars: non-pandas dataset defined in catalog and used in pipeline
not_dataframe: non-pandas dataset defined in catalog and used in pipeline
not_in_catalog: used in pipeline but not in catalog
parameters, params:z: the two different ways to used parameters

So you can see this isn't perfect since e.g. not_dataframe and polars currently serve the same purpose, and we don't have a case for pandas dataset pattern not used in pipeline. We don't necessarily need to be completely exhaustive here and cover every possible case but should cover most representative ones - I leave it to you to decide what those are!

The expected behaviour should be that dataset_from_catalog without pipeline gives all pandas datasets defined in the catalog, even if they're not used anywhere, and dataset_from_catalog with pipeline gives the same PLUS pattern datasets that are used in pipeline.

I think inputs should be inputs=["A", "C1", "B", "Z", "parameters", "params:z"], i.e. "B" instead of "D" as "D" is not defined is the catalog.

This was intentional as explained above - we should test the case that it's not defined in the catalog. In fact a previous version of this code that I wrote before would have crashed such a case, which is common in practice, and would not have been caught by these tests if we don't case.

Shouldn't test 1 return {"A", "B", "Z"} and test 2 {"A", "B", "Z", "C1"}?

Yes, sorry, that was my mistake! Z shouldn't have been a pandas dataset at all but pickle.PickeDataSet, like you did correctly here.

As DataCatalog does not inherit from CatalogProtocal, shouldn't the catalog type hint be something like catalog: DataCatalog | CatalogProtocol? Otherwise I think static type checks will report if the function is called with an instance of DataCatalog.

No, this should be fine as just CatalogProtocol. Even though DataCatalog does not inherit from CatalogProtocol, it implements all the methods described by the protocol (duck typing) and so mypy should be fine with it.

vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md

vizro-core/tests/unit/vizro/integrations/kedro/fixtures/test_catalog.yaml

vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py

vizro-core/src/vizro/integrations/kedro/_data_manager.py

antonymilne · 2025-02-10T17:52:18Z

@antonymilne Just a few comments on the implementation you suggested in #988.

Also, when I ran `hatch run test-unit, I did not see the kedro_integration tests in the list of tests collected by pytest. Not sure why, but happy to investifate if needed.

They are definitely running in CI (you can see easily because they're failing due to the polars thing):

tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py FF        [ 13%]

You'll note that actually they pass on Python 3.13:

The reason for this is that the kedro tests are skipped on that version because kedro doesn't fully support Python 3.13 yet (at least not time I checked). So I guess what's happened for you is that hatch run test-unit created an environment with Python 3.13 which isn't running the kedro tests. FYI the hatch environment configuration is done like this in hatch.toml:

[envs.all.overrides]
# Kedro is currently not compatible with Python 3.13 and returns exceptions when trying to run the unit tests on
# Python 3.13. These exceptions turned out to be difficult to ignore: https://github.com/mckinsey/vizro/pull/216
matrix.python.features = [
  {value = "kedro", if = ["3.9", "3.10", "3.11", "3.12"]}
]

You can check what Python version hatch is using with hatch run python -V.

You should be able to fix this by doing hatch run all.py3.12:unit-test which will run the unit tests in a Python 3.12 environment and hopefully include the kedro tests.

for more information, see https://pre-commit.ci

Signed-off-by: Guillaume Tauzin <[email protected]>

gtauzin · 2025-02-10T19:17:34Z

Thanks for the review @antonymilne. I believe I included all your suggestions.

The reason for this is that the kedro tests are skipped on that version because kedro doesn't fully support Python 3.13 yet (at least not time I checked). So I guess what's happened for you is that hatch run test-unit created an environment with Python 3.13 which isn't running the kedro tests. FYI the hatch environment configuration is done like this in hatch.toml:
[envs.all.overrides]
# Kedro is currently not compatible with Python 3.13 and returns exceptions when trying to run the unit tests on
# Python 3.13. These exceptions turned out to be difficult to ignore: https://github.com/mckinsey/vizro/pull/216
matrix.python.features = [
  {value = "kedro", if = ["3.9", "3.10", "3.11", "3.12"]}
]
You can check what Python version hatch is using with hatch run python -V.

You should be able to fix this by doing hatch run all.py3.12:unit-test which will run the unit tests in a Python 3.12 environment and hopefully include the kedro tests.

Thanks for the explanation, it was my first time using hatch (at least not as a uv build backend).

Seems like it was test-unit instead of unit-test, but it failed as well:

(vizro) ➜  vizro-core git:(dataset_factory) ✗ hatch run all.py3.12:test-unit
ImportError while loading conftest '/home/guillaume/Workspace/vizro/vizro-core/tests/conftest.py'.
tests/conftest.py:4: in <module>
    from vizro import Vizro
E   ModuleNotFoundError: No module named 'vizro'

However, hatch run all.py3.11:test-unit ran fine! Out of curioisity, is it possible to run pytest directly? That would have been easier to iterate faster.

FYI, kedro now support python 3.13 (at least according to the badges on the main README.md)

gtauzin · 2025-02-11T07:43:36Z

BTW, I think it might be nice to have those functions all documented in the API reference. Happy to take care of this if you agree!

antonymilne

Thank you for all your work on this! CI is doing something a bit strange, not sure what's happening exactly, but I may push a few commits on to this branch. Leave it with me to get a second approval on this and get it merged.

@gtauzin

However, hatch run all.py3.11:test-unit ran fine! Out of curioisity, is it possible to run pytest directly? That would have been easier to iterate faster.

This is strange - not sure what's happened there but basically the way to think of these hatch environments is that they are ephemeral and (thanks to uv) quick and easy to remove and recreate. So if something unexpected happens then your best bet is hatch env remove or hatch env prune and then to just do the hatch run command again, which will recreate your environment in a few seconds.

It is indeed possible to run pytest directly - if you have a look at hatch.toml then you'll see that the test-unit script does exactly this and is really just a very thin wrapper:

test-unit = "pytest tests/unit {args}"

The beauty of hatch is that it will (in theory) seamlessly handle all the dependencies you need to run this in the right environment, so running pytest directly won't really be any easier.

In reality I think you have been unlucky here because 99% of the time hatch has made our dev process hugely easier, faster and more reproducible. There is very rarely inconsistency between CI and what happens on your local machine, I promise! Once you get used to hatch then I can almost guarantee you will love it 😀

FYI, kedro now support python 3.13 (at least according to the badges on the main README.md)

Interesting! Let me take a look and see if we can do this. I also realised we need to bump our kedro dependency - I'll do that in this PR too.

BTW, I think it might be nice to have those functions all documented in the API reference. Happy to take care of this if you agree!

Yes, I also thought this! Please do go ahead with it, but in a separate PR so I can try and fix CI in this one and get it merged.

antonymilne · 2025-02-11T10:52:17Z

@gtauzin I don't seem to be able to push to your fork - is there something you can do from your side to give me permissions to do so? 🙏 e.g. following instructions on https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

gtauzin · 2025-02-11T10:57:03Z

@gtauzin I don't seem to be able to push to your fork - is there something you can do from your side to give me permissions to do so? 🙏 e.g. following instructions on https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

Strange, it is checked from what I can see.

Let me give you access to my fork directly.

gtauzin · 2025-02-11T11:08:30Z

However, hatch run all.py3.11:test-unit ran fine! Out of curioisity, is it possible to run pytest directly? That would have been easier to iterate faster.

This is strange - not sure what's happened there but basically the way to think of these hatch environments is that they are ephemeral and (thanks to uv) quick and easy to remove and recreate. So if something unexpected happens then your best bet is hatch env remove or hatch env prune and then to just do the hatch run command again, which will recreate your environment in a few seconds.

It is indeed possible to run pytest directly - if you have a look at hatch.toml then you'll see that the test-unit script does exactly this and is really just a very thin wrapper:
test-unit = "pytest tests/unit {args}"
The beauty of hatch is that it will (in theory) seamlessly handle all the dependencies you need to run this in the right environment, so running pytest directly won't really be any easier.

In reality I think you have been unlucky here because 99% of the time hatch has made our dev process hugely easier, faster and more reproducible. There is very rarely inconsistency between CI and what happens on your local machine, I promise! Once you get used to hatch then I can almost guarantee you will love it 😀

I see. I guess this is very similar to tox. I'll look into it, thank you! :)

BTW, I think it might be nice to have those functions all documented in the API reference. Happy to take care of this if you agree!

Yes, I also thought this! Please do go ahead with it, but in a separate PR so I can try and fix CI in this one and get it merged.

Sure, I'll get to it soon!

Thanks for the support throughout this PR!

for more information, see https://pre-commit.ci

…to dataset_factory

antonymilne · 2025-02-11T11:39:20Z

ok, following 5d33293 this PR is good to merge as far as I'm concerned. There's currently a broken link in an unrelated bit of the docs breaking the docs build but I'll fix that somehow or force merge to ignore it.

@gtauzin: I've seen this sort of problem before with github, where allowing maintainers access to your fork in the PR only seems to have very limited effect. I'm not sure what causes it but think it might be a bug on the github side. Have you ever had maintainers push successfully to your forks elsewhere before without needing to grant them access to the whole repo?

Hatch does indeed operate a bit like tox (plus many other tools, all combined into one beautiful whole!). It makes lots of things really simple and quick and reliable for us. e.g. just now I tested that our kedro tesets pass across all Python versions in less than a minute using hatch run all:test-unit -k kedro. And I can pretty much guarantee that it will do exactly the same in CI and on anyone's computer.

gtauzin · 2025-02-11T12:14:58Z

@gtauzin: I've seen this sort of problem before with github, where allowing maintainers access to your fork in the PR only seems to have very limited effect. I'm not sure what causes it but think it might be a bug on the github side. Have you ever had maintainers push successfully to your forks elsewhere before without needing to grant them access to the whole repo?

Never happened to me before, but I probably have less experience than you.

Hatch does indeed operate a bit like tox (plus many other tools, all combined into one beautiful whole!). It makes lots of things really simple and quick and reliable for us. e.g. just now I tested that our kedro tesets pass across all Python versions in less than a minute using hatch run all:test-unit -k kedro. And I can pretty much guarantee that it will do exactly the same in CI and on anyone's computer.

Nice! :)

antonymilne

I decided that we should still keep this compatible with older version of kedro (while enabling the new functionality in versions of kedro that support it). I'll push some more code to do this but let's not merge until I've done it.

petar-qb

Wow, @gtauzin!! 🚀🏅 Huge thanks for such an amazing PR!

The discussion you had with @antonymilne, both in this issue and the PR comments was a great reminder of just how powerful Kedro is. After going through everything, it was super clear to understand the PR code and the reasoning behind it.

I have just two very minor (and totally ignorable) comments, and @antonymilne after you provide the support the older Kedro versions too, feel free to merge this PR! 🎉

vizro-core/src/vizro/integrations/kedro/_data_manager.py

vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py

Co-authored-by: Petar Pejovic <[email protected]>

for more information, see https://pre-commit.ci

antonymilne · 2025-02-11T17:17:59Z

Thanks for the quick review @petar-qb!

@gtauzin I just pushed 0514fe3 which hopefully is the last thing we need to do here. I'll check it over one last time with fresh eyes tomorrow morning and then merge. Thanks again for your work on this!

vizro-core/tests/unit/vizro/integrations/kedro/fixtures/test_catalog.yaml

vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md

vizro-core/src/vizro/__init__.py

vizro-core/src/vizro/_vizro.py

vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py

astrojuanlu · 2025-02-12T11:28:05Z

On behalf of the Kedro team, thanks for your contribution @gtauzin and thanks Vizro team for such a swift review!

Enable datasets_from_catalog to return factory-based datasets

1806821

Signed-off-by: Guillaume Tauzin <[email protected]>

gtauzin requested review from stichbury, Joseph-Perkins, antonymilne, huong-li-nguyen, maxschulz-COL and lingyielia as code owners February 8, 2025 10:56

gtauzin commented Feb 8, 2025

View reviewed changes

vizro-core/docs/pages/user-guides/kedro-data-catalog.md Outdated Show resolved Hide resolved

vizro-core/src/vizro/integrations/kedro/_data_manager.py Outdated Show resolved Hide resolved

Use Union instead of |

828735a

Signed-off-by: Guillaume Tauzin <[email protected]>

stichbury reviewed Feb 10, 2025

View reviewed changes

vizro-core/docs/pages/user-guides/kedro-data-catalog.md Outdated Show resolved Hide resolved

stichbury reviewed Feb 10, 2025

View reviewed changes

antonymilne reviewed Feb 10, 2025

View reviewed changes

vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md Outdated Show resolved Hide resolved

vizro-core/docs/pages/user-guides/kedro-data-catalog.md Outdated Show resolved Hide resolved

vizro-core/docs/pages/user-guides/kedro-data-catalog.md Outdated Show resolved Hide resolved

antonymilne reviewed Feb 10, 2025

View reviewed changes

vizro-core/src/vizro/integrations/kedro/_data_manager.py Show resolved Hide resolved

antonymilne reviewed Feb 10, 2025

View reviewed changes

Apply suggestions from code review

0d47560

gtauzin force-pushed the dataset_factory branch from adac56b to 0d47560 Compare February 10, 2025 18:40

pre-commit-ci bot and others added 2 commits February 10, 2025 18:40

[pre-commit.ci] auto fixes from pre-commit.com hooks

dbcc2f3

for more information, see https://pre-commit.ci

Extend tests to both type of catalog

3f33ff3

Signed-off-by: Guillaume Tauzin <[email protected]>

antonymilne changed the title ~~Enable datasets_from_catalog to return factory-based datasets~~ [Feat] Enable datasets_from_catalog to return factory-based datasets Feb 11, 2025

Merge branch 'main' into dataset_factory

75f1b85

antonymilne mentioned this pull request Feb 11, 2025

Kedro-Vizro integration #1008

Open

antonymilne approved these changes Feb 11, 2025

View reviewed changes

Test push

48956b2

antonymilne and others added 4 commits February 11, 2025 11:24

Bump to kedro>=0.19.9 and enable tests for Py 3.13

5d33293

[pre-commit.ci] auto fixes from pre-commit.com hooks

5b5d148

for more information, see https://pre-commit.ci

Small docs update

e96b96b

Merge remote-tracking branch 'refs/remotes/origin/dataset_factory' in…

2b3e13b

…to dataset_factory

Fix link

4399eb7

antonymilne requested a review from petar-qb February 11, 2025 14:40

antonymilne requested changes Feb 11, 2025

View reviewed changes

petar-qb approved these changes Feb 11, 2025

View reviewed changes

vizro-core/src/vizro/integrations/kedro/_data_manager.py Outdated Show resolved Hide resolved

vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py Outdated Show resolved Hide resolved

gtauzin and others added 4 commits February 11, 2025 17:34

Apply suggestions from code review

297f8da

Co-authored-by: Petar Pejovic <[email protected]>

Unrelated: use packaging for vizro version check

b7597e4

Ensure compatibility with kedro>=0.19.0

0514fe3

[pre-commit.ci] auto fixes from pre-commit.com hooks

05c3a32

for more information, see https://pre-commit.ci

antonymilne approved these changes Feb 12, 2025

View reviewed changes

Apply suggestions from code review

177e319

antonymilne enabled auto-merge (squash) February 12, 2025 10:10

antonymilne merged commit 512962e into mckinsey:main Feb 12, 2025
33 checks passed

gtauzin deleted the dataset_factory branch February 12, 2025 10:24

huong-li-nguyen mentioned this pull request Feb 12, 2025

[QA] Fix failing tests on main #1013

Merged

1 task

gtauzin mentioned this pull request Feb 14, 2025

[Docs] Add kedro_integration to API reference #1020

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Enable datasets_from_catalog to return factory-based datasets #1001

[Feat] Enable datasets_from_catalog to return factory-based datasets #1001

gtauzin commented Feb 8, 2025

gtauzin left a comment

gtauzin commented Feb 8, 2025

stichbury left a comment

antonymilne left a comment

antonymilne left a comment

antonymilne commented Feb 10, 2025

gtauzin commented Feb 10, 2025

gtauzin commented Feb 11, 2025

antonymilne left a comment

antonymilne commented Feb 11, 2025

gtauzin commented Feb 11, 2025 •

edited

Loading

gtauzin commented Feb 11, 2025

antonymilne commented Feb 11, 2025

gtauzin commented Feb 11, 2025

antonymilne left a comment

petar-qb left a comment

antonymilne commented Feb 11, 2025

astrojuanlu commented Feb 12, 2025

[Feat] Enable datasets_from_catalog to return factory-based datasets #1001

[Feat] Enable datasets_from_catalog to return factory-based datasets #1001

Conversation

gtauzin commented Feb 8, 2025

Description

Notice

gtauzin left a comment

Choose a reason for hiding this comment

gtauzin commented Feb 8, 2025

stichbury left a comment

Choose a reason for hiding this comment

antonymilne left a comment

Choose a reason for hiding this comment

antonymilne left a comment

Choose a reason for hiding this comment

antonymilne commented Feb 10, 2025

gtauzin commented Feb 10, 2025

gtauzin commented Feb 11, 2025

antonymilne left a comment

Choose a reason for hiding this comment

antonymilne commented Feb 11, 2025

gtauzin commented Feb 11, 2025 • edited Loading

gtauzin commented Feb 11, 2025

antonymilne commented Feb 11, 2025

gtauzin commented Feb 11, 2025

antonymilne left a comment

Choose a reason for hiding this comment

petar-qb left a comment

Choose a reason for hiding this comment

antonymilne commented Feb 11, 2025

astrojuanlu commented Feb 12, 2025

gtauzin commented Feb 11, 2025 •

edited

Loading