-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated docs with Croissant information. #330
Conversation
Hi @amercader, just moving the conversation from #328 to here. Many thanks for the changes to enable Croissant features as a plugin, and to provide a dedicated endpoint. I synced with these latest changes and tried them out, and all works as expected. I wouldn't have so easily figured out what's needed to enable these new features, so your work is a big help, and of course it then naturally fits in with your desired structure too. I looked at the existing docs, and thought that maybe the Croissant info could better sit in the existing files rather than a new file. Firstly, the schema info naturally sits alongside other schema info in Finally, when producing the new examples just mentioned, I noticed that the Please let me know what you think of the above when you have time. |
Hi @Reikyo thanks for this, see below:
This looks pretty high level so maybe there's something obvious that needs to be tweaked in the profile? This eventually needs to run in an automated test so we make sure the output remains valid in the future. I added the test in 7fd67fe so you can run it locally if you want but it's essentially the same as running the validator, only the test uses a dataset with all properties present. I'm getting errors when providing the
|
After much debugging I found out the issue causing the
|
Hi @amercader - thanks for the continued efforts here, and apologies as I've been busy with another project which has diverted attention lately. Firstly regarding documentation, yes please modify according to your preferred format and breakdown. I hope there's enough info in what I already supplied, just let me know if anything needs further clarification. I'll need to get some further input from others in the Croissant working group to be completely sure of the implementation details to be added to the docs. Secondly I need to spend some time with those tests and validation errors to check through properly. I had a very quick look, and noted that Please bear with me while I look into this, time is also further limited this week due to annual leave, but I'll try and get the needed conversations going at the very least. |
No worries @Reikyo I can completely relate. There's no rush in getting this merged and I'd rather get it right. I'll also spend some time this week to work on docs and look at these errors, so I'm confident we can get it over the line soon. |
@amercader Yes, agree on getting it right from launch, seeing as we've come this far already. I see what you mean with the http vs https issue for schema.org in @context. I'll have a word with the Croissant working group, as that looks like something on which the validator should perhaps be more flexible. With that changed, I then also see other errors as I'm sure you are, which seem to be related to the nested structure that rdflib creates if there's more than one graph node. On a quick look, my first impression was that the validator is demanding a flat structure like {blah1, blah2, blah3} rather than a nested structure like [{blah1}, {blah2}, {blah3}], but the plugin is designed to give the latter (following the schemaorg.py plugin). However, seeing as the plugin and the validator both rely on rdflib, it seems to me that they should both be able to read and write the variants that this library allows for. There could be more to it though, and I've noted that you flagged a possible issue with IDs too. I'll need a block of time to look at this all properly. As mentioned, I'm currently on leave but will aim to give this some focus next week. Thanks for your patience and continued support. |
You are correct, CKAN transforms its metadata into an RDFlib graph and serializes it, ML Croissant takes this serialization and turns it into an RDFlib graph so that should work fine. Turns out the failures were caused by the changes needed in the example and some fixes in the profile logic (36afa7d). I also introduced a change to the processors so you can pass a custom context (2fc9489 + a1390fa) so the validator stops complaining. Anyway, the bottom line is that the validation tests are passing now 🚀 https://github.com/ckan/ckanext-dcat/actions/runs/13199110300 I think the metadata schema and the profile could be tweaked a bit more to align it better with the spec (e.g. hashes are mandatory in Croissant but not in CKAN) but what we have is an excellent first iteration that we can improve in the future. If you can do some more testing on your end that would be great, I'd finish the docs next week. |
@amercader I've had a look your recent changes individually and all makes sense. Thanks also for catching the resource_dict->subresource_dict typo in croissant.py too. In fact that prompted me to review a chunk of things in that area and the associated section of the croissant.yaml schema file. I'll make some tweaks tomorrow and push. As you've seen, the subresources section I introduced is simply to allow people to describe the contents of container files e.g. ZIP files. A subresource can either be a cr:FileObject or a cr:FileSet. Each case has different needs, such as "includes" and "excludes" only being relevant for the latter. I've simply noted this in the help text for users, but it would be great if the presence of fields can actually be toggled based on the selection of another. So, for this example, if a subresource is flagged as cr:FileObject then "includes" and "excludes" fields wouldn't even be shown. Do you know if this kind of behaviour is possible? Regarding certain required aspects of the Croissant spec, yes there are some differences with what comes out of vanilla CKAN when not using the bespoke Croissant schema. But we don't want to force users to change their schema because of this. As long as we have as many fields as possible mapped to a Croissant(-like) output via the profile, in as many cases as possible, then that will still be helpful and support adoption of the standard. |
That sounds great
I was confused by this, as I was not finding any mention of sub resources in the spec. Why can't we use CKAN resources to model the Croissant Resources directly? We can apply the type property (cr:FileObject or a cr:FileSet) and the includes / excludes ones to the resource schema directly and we will avoid the complexity of the subresources. Unless I'm missing something.
This is not well supported out the box in CKAN but I'm sure we can use a custom form snippet for the resource type field that uses some js to hide/show the includes/excludes fields when changing the value
I very much agree with this approach. |
BTW I'm trying to go one level deeper to make this more useful and expose |
@Reikyo my apologies, your comment appeared duplicated and when I deleted one of the copies stupid Github got rid of both. I'm really sorry! |
@amercader Yes I originally tried to post when on the "Preview" tab but it didn't seem to work, so I went back to the "Write" tab and tried again ... but it was just being slow and it posted twice. I did actually delete the duplicate myself, but for some reason it still showed you both even though there was really only one there, so your deletion cleared everything. Basically GitHub was being a bit glitchy, but I saved a copy and here it is: Original message (plus tweaks!): As far as I was seeing, a resource is a distinct uploaded file. If a resource happens to be a container file (ZIP etc.), then it can have subresources which then don't require separate upload, hence aren't resources themselves, and for the purposes of this plugin we only require information about them to enrich the metadata of the associated resource. For a particular resource, entering its subresource info on the resource's own edit page then makes it clear that subresources don't require separate upload, and also allows for automated parent-child identification in construction of the graph, which is what the "containedIn" line is all about in the croissant.py profile. Also, seeing as only the resource itself is uploaded, the hash is only needed there rather than for the individual subresources. I see you've introduced it for subresources too, but I'm not sure this is warranted. Apologies, my understanding does fall down a bit here, but this observation is supported by what's seen in the Croissant spec. For example, in the following snippet from the spec, flores200_dataset.tar.gz is a top-level resource and is the only thing which is actually uploaded, and so the only thing with a hash to verify the download against. It contains a couple of cr:FileSet items and a couple of cr:FileObject items, none of which have hashes but all of which have the containedIn field instead, and all of which are subresources for the purposes of the CKAN plugin. The cr:FileObject items are simple files, and the cr:FileSet items are descriptions about groups of files:
Basically, the unzipped file structure is as follows:
This is, admittedly, if I've understood the spec correctly myself! One area the current approach falls short is that technically a container file can contain another container file, so a subresource can therefore be a container, and itself have subresources. Actually the issue is really just with the containedIn field (I'm thinking aloud here) ... if the user could specify the containedIn field, then they can override the default that would otherwise point to the top-level resource. This would require them to give IDs in order to specify the appropriate relationships, which could be error prone, but it's maybe better than nothing. This is another reason to have all subresource entries be done on the same edit page, rather than over separate edit pages, as then all information can be seen and edited at once for a top-level resource downwards. So to give you a heads-up, the changes I'm looking to make are: Resources
Subresources
|
Thanks for the detailed answer @Reikyo , I'll have a proper read tomorrow and get back to you . IIRC I added the hash to subresources because the validator was moaning about it but I'll double check |
@amercader Just a quick message to say I've made the changes mentioned yesterday and am just working through a couple of final issues. Will update again soon. |
- Added url, id_given_contained_in, size and same_as for subresources - Modified same_as cardinality from MANY to ONE for resources and subresources. The specification states MANY is actually acceptable, but the validator doesn't let this pass for some reason. - Modified help text for better detail and clarity profiles/croissant.py: - Added "excludes" to the custom JSONLD_CONTEXT. Even though this is not present in the specification, it follows the same pattern used for "includes", and ensures the output is as desired. - Removed use of _get_dict_value(). This seemed superfluous given the standard dictionary get() method. - Added a "type" of "fileObject" to top-level resources as standard - Harmonised _resources_graph() and _resource_subresources_graph(), and all the functions they make use of - Changed from CR.containedIn to SCHEMA.containedIn. The former was causing validator issues, but not sure if the latter is technically correct due to referring to places, see schema.org for details. It works here though.
… the Croissant schema
@amercader I've finalised the latest changes and pushed (this PR still includes the doc changes, so of course feel free to do whatever you need to do with those). I found that the hash is needed for FileObjects which are top-levels resources (as expected, as these are the primary upload/downloads), but not for FileObjects which are subresources. The issue with the latter was solved by changing the output nodes from using
In a similar observation, FileSet output nodes used I found that the I found that FileObjects (both top-level resources and subresources) can only have one "Same as" entry, even though the Croissant spec states a cardinality of many, so I temporarily changed these multiple entry fields to single entry. Again, I think either the spec or the validator may need tweaking, they're not in sync. We'll then know which way to finalise this field. I found that use of
we instead have output like:
Although the latter is accepted by the validator, it does make everything more difficult to read. If you have an idea on how we can restrict this behaviour then that would be appreciated. I noticed that you've been looking into the graph structure with changes you made yesterday, so maybe there's something in that area you may already be familiar with. On that note, following your changes the graph creation now seems to be nesting subresources within the resource in the distribution block, whereas ideally all resources and subresources should sit side-by-side at the same level. Again, if you have some idea of how to get this behaviour that would be good. Finally, I included a couple of example output files.
|
@Reikyo thanks Forgot to mention that now subresources make sense, thanks for your explanation. It is true that is a bit cumbersome to deal with them with the standard CKAN support (which just takes resources into account), but it's the best we can do now. See below for more comments:
I'll defer to your judgement here as you know the spec better, we can always change the serialization in future versions
I know, although the fix is easy I found that annoying and created an issue in the croissant repo for it a few days ago :) mlcommons/croissant#804
I agree that the compact form is much nicer but I don't think there's any workaround, that just seems to be how JSON-LD works if the context defines a default I'm happy to either leave it as is with the
I was trying to use framing to get a nicer output where instead of a list of entities under a It's still a valid graph but it's not intuitive for humans so if I can't figure out how to keep resources and sub-resources side by side I'll revert it
That's great. The example Croissant outputs should go to the directory diff --git a/ckanext/dcat/tests/profiles/croissant/test_validate.py b/ckanext/dcat/tests/profiles/croissant/test_validate.py
index e839f29..e1df0a2 100644
--- a/ckanext/dcat/tests/profiles/croissant/test_validate.py
+++ b/ckanext/dcat/tests/profiles/croissant/test_validate.py
@@ -30,6 +30,8 @@ def test_valid_output():
croissant_dict = json.loads(
s.g.serialize(format="json-ld", auto_compact=True, context=JSONLD_CONTEXT)
)
+ with open("graph.jsonld", "w") as f:
+ f.write(json.dumps(croissant_dict))
try:
mlc.Dataset(croissant_dict) Once we settle on all the serialization issues we can save the final example. Thanks for your work on this, we are really close |
@Reikyo I've created #339 to get everything ready to merge this feature and consolidate all changes in a single PR (it includes all your recent changes in this PR) Have a look at the docs and let me know if you are missing something: https://docs.ckan.org/projects/ckanext-dcat/en/croissant/croissant/ |
Closing in favour of #339 |
No description provided.