Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Enable datasets_from_catalog to return factory-based datasets #1001

Merged
merged 17 commits into from
Feb 12, 2025

Conversation

gtauzin
Copy link
Contributor

@gtauzin gtauzin commented Feb 8, 2025

Description

This PR adds the capabilty to the kedro integration to exploit datasets created using dataset factories.

Fixes #988.

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

    • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
    • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
    • I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.
    • I have not referenced individuals, products or companies in any commits, directly or indirectly.
    • I have not added data or restricted code in any commits, directly or indirectly.

Copy link
Contributor Author

@gtauzin gtauzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antonymilne Just a few comments on the implementation you suggested in #988.

Also, when I ran `hatch run test-unit, I did not see the kedro_integration tests in the list of tests collected by pytest. Not sure why, but happy to investifate if needed.

vizro-core/docs/pages/user-guides/kedro-data-catalog.md Outdated Show resolved Hide resolved
vizro-core/src/vizro/integrations/kedro/_data_manager.py Outdated Show resolved Hide resolved
Signed-off-by: Guillaume Tauzin <[email protected]>
@gtauzin
Copy link
Contributor Author

gtauzin commented Feb 8, 2025

The CI is failing because I added a polars dataset to the catalog and polars is not included in the dependencies.

This is an interesting edge case because in the case where I have a large number of datasets in my catalog and I want to load a few of them (all pandas-based) for a vizro dashboards, I should not be forced to include all dependencies related to all datasets in my catalog. Does it make sense? I feel that datasets_from_catalog should take a list of dataset names as a parameter so we can control which ones are instantiated.

Copy link
Contributor

@stichbury stichbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from docs perspective, with just two minor tweaks. Thank you so much @gtauzin for this contribution 🌟🌟🌟🌟🌟

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your PR @gtauzin! This looks great 💯 I just looked through the docs so far (thanks very much for doing those as well!) and left a few comments but it generally looks great. I'll review the code and tests when I get a chance later today.

vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md Outdated Show resolved Hide resolved
vizro-core/docs/pages/user-guides/kedro-data-catalog.md Outdated Show resolved Hide resolved
vizro-core/docs/pages/user-guides/kedro-data-catalog.md Outdated Show resolved Hide resolved
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now finished reviewing the code and this looks great, thank you so much! Let me just answer some of your questions from #988 here since they're directly related to the PR:

Given it still seem there could be some uncertainties related to the future API of KedroDataCatalog should I write tests based on KedroDataCatalog catalogs (as well as DataCatalog catalogs)? That way they'll serve as a reminder to update the code if anything change.

This would be great if you can be bothered to do it but I'm not too fussed if you can't. The best way to do it would be to use a parametrized fixture to replace catalog_path fixture with something like this:

@pytest.fixture(params=[DataCatalog, KedroDataCatalog]
def catalog(request):
    catalog_class =  request.param
    catalog_path = Path(__file__).parent / "fixtures/test_catalog.yaml"
return catalog_class.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))

# for both tests
def test_...(catalog):
   # as now

Could you please help me double check your tests?

I made some code suggestions there but basically the intention is to test all (or at least most) possible cases of dataset definition in one test. Based on my suggestions we now have:
*pandas_excel: pandas dataset defined in catalog and used in pipeline

  • pandas_parquet: pandas dataset defined in catalog and not used in pipeline
  • something#csv: pandas dataset pattern used in pipeline
  • polars: non-pandas dataset defined in catalog and used in pipeline
  • not_dataframe: non-pandas dataset defined in catalog and used in pipeline
  • not_in_catalog: used in pipeline but not in catalog
  • parameters, params:z: the two different ways to used parameters

So you can see this isn't perfect since e.g. not_dataframe and polars currently serve the same purpose, and we don't have a case for pandas dataset pattern not used in pipeline. We don't necessarily need to be completely exhaustive here and cover every possible case but should cover most representative ones - I leave it to you to decide what those are!

The expected behaviour should be that dataset_from_catalog without pipeline gives all pandas datasets defined in the catalog, even if they're not used anywhere, and dataset_from_catalog with pipeline gives the same PLUS pattern datasets that are used in pipeline.

I think inputs should be inputs=["A", "C1", "B", "Z", "parameters", "params:z"], i.e. "B" instead of "D" as "D" is not defined is the catalog.

This was intentional as explained above - we should test the case that it's not defined in the catalog. In fact a previous version of this code that I wrote before would have crashed such a case, which is common in practice, and would not have been caught by these tests if we don't case.

Shouldn't test 1 return {"A", "B", "Z"} and test 2 {"A", "B", "Z", "C1"}?

Yes, sorry, that was my mistake! Z shouldn't have been a pandas dataset at all but pickle.PickeDataSet, like you did correctly here.

As DataCatalog does not inherit from CatalogProtocal, shouldn't the catalog type hint be something like catalog: DataCatalog | CatalogProtocol? Otherwise I think static type checks will report if the function is called with an instance of DataCatalog.

No, this should be fine as just CatalogProtocol. Even though DataCatalog does not inherit from CatalogProtocol, it implements all the methods described by the protocol (duck typing) and so mypy should be fine with it.

@antonymilne
Copy link
Contributor

@antonymilne Just a few comments on the implementation you suggested in #988.

Also, when I ran `hatch run test-unit, I did not see the kedro_integration tests in the list of tests collected by pytest. Not sure why, but happy to investifate if needed.

They are definitely running in CI (you can see easily because they're failing due to the polars thing):

tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py FF        [ 13%]

You'll note that actually they pass on Python 3.13:
image

The reason for this is that the kedro tests are skipped on that version because kedro doesn't fully support Python 3.13 yet (at least not time I checked). So I guess what's happened for you is that hatch run test-unit created an environment with Python 3.13 which isn't running the kedro tests. FYI the hatch environment configuration is done like this in hatch.toml:

[envs.all.overrides]
# Kedro is currently not compatible with Python 3.13 and returns exceptions when trying to run the unit tests on
# Python 3.13. These exceptions turned out to be difficult to ignore: https://github.com/mckinsey/vizro/pull/216
matrix.python.features = [
  {value = "kedro", if = ["3.9", "3.10", "3.11", "3.12"]}
]

You can check what Python version hatch is using with hatch run python -V.

You should be able to fix this by doing hatch run all.py3.12:unit-test which will run the unit tests in a Python 3.12 environment and hopefully include the kedro tests.

@gtauzin
Copy link
Contributor Author

gtauzin commented Feb 10, 2025

Thanks for the review @antonymilne. I believe I included all your suggestions.

The reason for this is that the kedro tests are skipped on that version because kedro doesn't fully support Python 3.13 yet (at least not time I checked). So I guess what's happened for you is that hatch run test-unit created an environment with Python 3.13 which isn't running the kedro tests. FYI the hatch environment configuration is done like this in hatch.toml:

[envs.all.overrides]
# Kedro is currently not compatible with Python 3.13 and returns exceptions when trying to run the unit tests on
# Python 3.13. These exceptions turned out to be difficult to ignore: https://github.com/mckinsey/vizro/pull/216
matrix.python.features = [
  {value = "kedro", if = ["3.9", "3.10", "3.11", "3.12"]}
]

You can check what Python version hatch is using with hatch run python -V.

You should be able to fix this by doing hatch run all.py3.12:unit-test which will run the unit tests in a Python 3.12 environment and hopefully include the kedro tests.

Thanks for the explanation, it was my first time using hatch (at least not as a uv build backend).

Seems like it was test-unit instead of unit-test, but it failed as well:

(vizro) ➜  vizro-core git:(dataset_factory) ✗ hatch run all.py3.12:test-unit
ImportError while loading conftest '/home/guillaume/Workspace/vizro/vizro-core/tests/conftest.py'.
tests/conftest.py:4: in <module>
    from vizro import Vizro
E   ModuleNotFoundError: No module named 'vizro'

However, hatch run all.py3.11:test-unit ran fine! Out of curioisity, is it possible to run pytest directly? That would have been easier to iterate faster.

FYI, kedro now support python 3.13 (at least according to the badges on the main README.md)

@gtauzin
Copy link
Contributor Author

gtauzin commented Feb 11, 2025

BTW, I think it might be nice to have those functions all documented in the API reference. Happy to take care of this if you agree!

@antonymilne antonymilne changed the title Enable datasets_from_catalog to return factory-based datasets [Feat] Enable datasets_from_catalog to return factory-based datasets Feb 11, 2025
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all your work on this! CI is doing something a bit strange, not sure what's happening exactly, but I may push a few commits on to this branch. Leave it with me to get a second approval on this and get it merged.

@gtauzin

However, hatch run all.py3.11:test-unit ran fine! Out of curioisity, is it possible to run pytest directly? That would have been easier to iterate faster.

This is strange - not sure what's happened there but basically the way to think of these hatch environments is that they are ephemeral and (thanks to uv) quick and easy to remove and recreate. So if something unexpected happens then your best bet is hatch env remove or hatch env prune and then to just do the hatch run command again, which will recreate your environment in a few seconds.

It is indeed possible to run pytest directly - if you have a look at hatch.toml then you'll see that the test-unit script does exactly this and is really just a very thin wrapper:

test-unit = "pytest tests/unit {args}"

The beauty of hatch is that it will (in theory) seamlessly handle all the dependencies you need to run this in the right environment, so running pytest directly won't really be any easier.

In reality I think you have been unlucky here because 99% of the time hatch has made our dev process hugely easier, faster and more reproducible. There is very rarely inconsistency between CI and what happens on your local machine, I promise! Once you get used to hatch then I can almost guarantee you will love it 😀

FYI, kedro now support python 3.13 (at least according to the badges on the main README.md)

Interesting! Let me take a look and see if we can do this. I also realised we need to bump our kedro dependency - I'll do that in this PR too.

BTW, I think it might be nice to have those functions all documented in the API reference. Happy to take care of this if you agree!

Yes, I also thought this! Please do go ahead with it, but in a separate PR so I can try and fix CI in this one and get it merged.

@antonymilne
Copy link
Contributor

@gtauzin I don't seem to be able to push to your fork - is there something you can do from your side to give me permissions to do so? 🙏 e.g. following instructions on https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

@gtauzin
Copy link
Contributor Author

gtauzin commented Feb 11, 2025

@gtauzin I don't seem to be able to push to your fork - is there something you can do from your side to give me permissions to do so? 🙏 e.g. following instructions on https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

Strange, it is checked from what I can see.

Let me give you access to my fork directly.

@gtauzin
Copy link
Contributor Author

gtauzin commented Feb 11, 2025

However, hatch run all.py3.11:test-unit ran fine! Out of curioisity, is it possible to run pytest directly? That would have been easier to iterate faster.

This is strange - not sure what's happened there but basically the way to think of these hatch environments is that they are ephemeral and (thanks to uv) quick and easy to remove and recreate. So if something unexpected happens then your best bet is hatch env remove or hatch env prune and then to just do the hatch run command again, which will recreate your environment in a few seconds.

It is indeed possible to run pytest directly - if you have a look at hatch.toml then you'll see that the test-unit script does exactly this and is really just a very thin wrapper:

test-unit = "pytest tests/unit {args}"

The beauty of hatch is that it will (in theory) seamlessly handle all the dependencies you need to run this in the right environment, so running pytest directly won't really be any easier.

In reality I think you have been unlucky here because 99% of the time hatch has made our dev process hugely easier, faster and more reproducible. There is very rarely inconsistency between CI and what happens on your local machine, I promise! Once you get used to hatch then I can almost guarantee you will love it 😀

I see. I guess this is very similar to tox. I'll look into it, thank you! :)

BTW, I think it might be nice to have those functions all documented in the API reference. Happy to take care of this if you agree!

Yes, I also thought this! Please do go ahead with it, but in a separate PR so I can try and fix CI in this one and get it merged.

Sure, I'll get to it soon!

Thanks for the support throughout this PR!

@antonymilne
Copy link
Contributor

ok, following 5d33293 this PR is good to merge as far as I'm concerned. There's currently a broken link in an unrelated bit of the docs breaking the docs build but I'll fix that somehow or force merge to ignore it.

@gtauzin: I've seen this sort of problem before with github, where allowing maintainers access to your fork in the PR only seems to have very limited effect. I'm not sure what causes it but think it might be a bug on the github side. Have you ever had maintainers push successfully to your forks elsewhere before without needing to grant them access to the whole repo?

Hatch does indeed operate a bit like tox (plus many other tools, all combined into one beautiful whole!). It makes lots of things really simple and quick and reliable for us. e.g. just now I tested that our kedro tesets pass across all Python versions in less than a minute using hatch run all:test-unit -k kedro. And I can pretty much guarantee that it will do exactly the same in CI and on anyone's computer.

@gtauzin
Copy link
Contributor Author

gtauzin commented Feb 11, 2025

@gtauzin: I've seen this sort of problem before with github, where allowing maintainers access to your fork in the PR only seems to have very limited effect. I'm not sure what causes it but think it might be a bug on the github side. Have you ever had maintainers push successfully to your forks elsewhere before without needing to grant them access to the whole repo?

Never happened to me before, but I probably have less experience than you.

Hatch does indeed operate a bit like tox (plus many other tools, all combined into one beautiful whole!). It makes lots of things really simple and quick and reliable for us. e.g. just now I tested that our kedro tesets pass across all Python versions in less than a minute using hatch run all:test-unit -k kedro. And I can pretty much guarantee that it will do exactly the same in CI and on anyone's computer.

Nice! :)

@antonymilne antonymilne requested a review from petar-qb February 11, 2025 14:40
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided that we should still keep this compatible with older version of kedro (while enabling the new functionality in versions of kedro that support it). I'll push some more code to do this but let's not merge until I've done it.

Copy link
Contributor

@petar-qb petar-qb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, @gtauzin!! 🚀🏅 Huge thanks for such an amazing PR!

The discussion you had with @antonymilne, both in this issue and the PR comments was a great reminder of just how powerful Kedro is. After going through everything, it was super clear to understand the PR code and the reasoning behind it.

I have just two very minor (and totally ignorable) comments, and @antonymilne after you provide the support the older Kedro versions too, feel free to merge this PR! 🎉

@antonymilne
Copy link
Contributor

Thanks for the quick review @petar-qb!

@gtauzin I just pushed 0514fe3 which hopefully is the last thing we need to do here. I'll check it over one last time with fresh eyes tomorrow morning and then merge. Thanks again for your work on this!

@antonymilne antonymilne enabled auto-merge (squash) February 12, 2025 10:10
@antonymilne antonymilne merged commit 512962e into mckinsey:main Feb 12, 2025
33 checks passed
@gtauzin gtauzin deleted the dataset_factory branch February 12, 2025 10:24
@astrojuanlu
Copy link

On behalf of the Kedro team, thanks for your contribution @gtauzin and thanks Vizro team for such a swift review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kedro integration cannot load datasets created using dataset factories
5 participants