-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Enable datasets_from_catalog to return factory-based datasets #1001
Conversation
Signed-off-by: Guillaume Tauzin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antonymilne Just a few comments on the implementation you suggested in #988.
Also, when I ran `hatch run test-unit, I did not see the kedro_integration tests in the list of tests collected by pytest. Not sure why, but happy to investifate if needed.
Signed-off-by: Guillaume Tauzin <[email protected]>
The CI is failing because I added a polars dataset to the catalog and polars is not included in the dependencies. This is an interesting edge case because in the case where I have a large number of datasets in my catalog and I want to load a few of them (all pandas-based) for a vizro dashboards, I should not be forced to include all dependencies related to all datasets in my catalog. Does it make sense? I feel that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from docs perspective, with just two minor tweaks. Thank you so much @gtauzin for this contribution 🌟🌟🌟🌟🌟
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your PR @gtauzin! This looks great 💯 I just looked through the docs so far (thanks very much for doing those as well!) and left a few comments but it generally looks great. I'll review the code and tests when I get a chance later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now finished reviewing the code and this looks great, thank you so much! Let me just answer some of your questions from #988 here since they're directly related to the PR:
Given it still seem there could be some uncertainties related to the future API of KedroDataCatalog should I write tests based on KedroDataCatalog catalogs (as well as DataCatalog catalogs)? That way they'll serve as a reminder to update the code if anything change.
This would be great if you can be bothered to do it but I'm not too fussed if you can't. The best way to do it would be to use a parametrized fixture to replace catalog_path
fixture with something like this:
@pytest.fixture(params=[DataCatalog, KedroDataCatalog]
def catalog(request):
catalog_class = request.param
catalog_path = Path(__file__).parent / "fixtures/test_catalog.yaml"
return catalog_class.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))
# for both tests
def test_...(catalog):
# as now
Could you please help me double check your tests?
I made some code suggestions there but basically the intention is to test all (or at least most) possible cases of dataset definition in one test. Based on my suggestions we now have:
*pandas_excel
: pandas dataset defined in catalog and used in pipeline
pandas_parquet
: pandas dataset defined in catalog and not used in pipelinesomething#csv
: pandas dataset pattern used in pipelinepolars
: non-pandas dataset defined in catalog and used in pipelinenot_dataframe
: non-pandas dataset defined in catalog and used in pipelinenot_in_catalog
: used in pipeline but not in catalogparameters
,params:z
: the two different ways to used parameters
So you can see this isn't perfect since e.g. not_dataframe
and polars
currently serve the same purpose, and we don't have a case for pandas dataset pattern not used in pipeline. We don't necessarily need to be completely exhaustive here and cover every possible case but should cover most representative ones - I leave it to you to decide what those are!
The expected behaviour should be that dataset_from_catalog
without pipeline
gives all pandas datasets defined in the catalog, even if they're not used anywhere, and dataset_from_catalog
with pipeline
gives the same PLUS pattern datasets that are used in pipeline
.
I think inputs should be inputs=["A", "C1", "B", "Z", "parameters", "params:z"], i.e. "B" instead of "D" as "D" is not defined is the catalog.
This was intentional as explained above - we should test the case that it's not defined in the catalog. In fact a previous version of this code that I wrote before would have crashed such a case, which is common in practice, and would not have been caught by these tests if we don't case.
Shouldn't test 1 return {"A", "B", "Z"} and test 2 {"A", "B", "Z", "C1"}?
Yes, sorry, that was my mistake! Z shouldn't have been a pandas
dataset at all but pickle.PickeDataSet
, like you did correctly here.
As DataCatalog does not inherit from CatalogProtocal, shouldn't the catalog type hint be something like catalog: DataCatalog | CatalogProtocol? Otherwise I think static type checks will report if the function is called with an instance of DataCatalog.
No, this should be fine as just CatalogProtocol
. Even though DataCatalog
does not inherit from CatalogProtocol
, it implements all the methods described by the protocol (duck typing) and so mypy should be fine with it.
vizro-core/tests/unit/vizro/integrations/kedro/fixtures/test_catalog.yaml
Outdated
Show resolved
Hide resolved
vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py
Outdated
Show resolved
Hide resolved
vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py
Outdated
Show resolved
Hide resolved
vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py
Outdated
Show resolved
Hide resolved
They are definitely running in CI (you can see easily because they're failing due to the polars thing):
You'll note that actually they pass on Python 3.13: The reason for this is that the kedro tests are skipped on that version because kedro doesn't fully support Python 3.13 yet (at least not time I checked). So I guess what's happened for you is that
You can check what Python version hatch is using with You should be able to fix this by doing |
adac56b
to
0d47560
Compare
for more information, see https://pre-commit.ci
Signed-off-by: Guillaume Tauzin <[email protected]>
Thanks for the review @antonymilne. I believe I included all your suggestions.
Thanks for the explanation, it was my first time using Seems like it was
However, FYI, kedro now support python 3.13 (at least according to the badges on the main README.md) |
BTW, I think it might be nice to have those functions all documented in the API reference. Happy to take care of this if you agree! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all your work on this! CI is doing something a bit strange, not sure what's happening exactly, but I may push a few commits on to this branch. Leave it with me to get a second approval on this and get it merged.
However, hatch run all.py3.11:test-unit ran fine! Out of curioisity, is it possible to run pytest directly? That would have been easier to iterate faster.
This is strange - not sure what's happened there but basically the way to think of these hatch environments is that they are ephemeral and (thanks to uv) quick and easy to remove and recreate. So if something unexpected happens then your best bet is hatch env remove
or hatch env prune
and then to just do the hatch run
command again, which will recreate your environment in a few seconds.
It is indeed possible to run pytest
directly - if you have a look at hatch.toml then you'll see that the test-unit
script does exactly this and is really just a very thin wrapper:
test-unit = "pytest tests/unit {args}"
The beauty of hatch is that it will (in theory) seamlessly handle all the dependencies you need to run this in the right environment, so running pytest directly won't really be any easier.
In reality I think you have been unlucky here because 99% of the time hatch has made our dev process hugely easier, faster and more reproducible. There is very rarely inconsistency between CI and what happens on your local machine, I promise! Once you get used to hatch then I can almost guarantee you will love it 😀
FYI, kedro now support python 3.13 (at least according to the badges on the main README.md)
Interesting! Let me take a look and see if we can do this. I also realised we need to bump our kedro dependency - I'll do that in this PR too.
BTW, I think it might be nice to have those functions all documented in the API reference. Happy to take care of this if you agree!
Yes, I also thought this! Please do go ahead with it, but in a separate PR so I can try and fix CI in this one and get it merged.
@gtauzin I don't seem to be able to push to your fork - is there something you can do from your side to give me permissions to do so? 🙏 e.g. following instructions on https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork |
Strange, it is checked from what I can see. ![]() Let me give you access to my fork directly. |
I see. I guess this is very similar to tox. I'll look into it, thank you! :)
Sure, I'll get to it soon! Thanks for the support throughout this PR! |
ok, following 5d33293 this PR is good to merge as far as I'm concerned. There's currently a broken link in an unrelated bit of the docs breaking the docs build but I'll fix that somehow or force merge to ignore it. @gtauzin: I've seen this sort of problem before with github, where allowing maintainers access to your fork in the PR only seems to have very limited effect. I'm not sure what causes it but think it might be a bug on the github side. Have you ever had maintainers push successfully to your forks elsewhere before without needing to grant them access to the whole repo? Hatch does indeed operate a bit like tox (plus many other tools, all combined into one beautiful whole!). It makes lots of things really simple and quick and reliable for us. e.g. just now I tested that our kedro tesets pass across all Python versions in less than a minute using |
Never happened to me before, but I probably have less experience than you.
Nice! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided that we should still keep this compatible with older version of kedro (while enabling the new functionality in versions of kedro that support it). I'll push some more code to do this but let's not merge until I've done it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, @gtauzin!! 🚀🏅 Huge thanks for such an amazing PR!
The discussion you had with @antonymilne, both in this issue and the PR comments was a great reminder of just how powerful Kedro is. After going through everything, it was super clear to understand the PR code and the reasoning behind it.
I have just two very minor (and totally ignorable) comments, and @antonymilne after you provide the support the older Kedro versions too, feel free to merge this PR! 🎉
vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Petar Pejovic <[email protected]>
for more information, see https://pre-commit.ci
vizro-core/tests/unit/vizro/integrations/kedro/fixtures/test_catalog.yaml
Outdated
Show resolved
Hide resolved
vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py
Outdated
Show resolved
Hide resolved
vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py
Outdated
Show resolved
Hide resolved
On behalf of the Kedro team, thanks for your contribution @gtauzin and thanks Vizro team for such a swift review! |
Description
This PR adds the capabilty to the kedro integration to exploit datasets created using dataset factories.
Fixes #988.
Notice
I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":