Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include isArray and arrayShape #807

Merged
merged 23 commits into from
Feb 19, 2025
Merged

Include isArray and arrayShape #807

merged 23 commits into from
Feb 19, 2025

Conversation

ccl-core
Copy link
Contributor

@ccl-core ccl-core commented Feb 12, 2025

Addresses: #649

Update: the final implementation uses:

{
          "@type": "cr:Field",
          "@id": "recordset/fieldname",
          "isArray": true,
          "arrayShape": "1,2,1"
       ...
}

======

Note that this is a bit different than the original proposal in #649 :

{
          "@type": "cr:Field",
          "@id": "recordset/fieldname",
          "isArray": true,
          "arrayShape": [
            1,
            2,
            1
          ]
       ...
}

This is a bit difficult for us to process using mlcroissant... When we create the graph from the jsonld data, we would loose the order information; moreover, as we create a triple for each element in the arrayShape, and triples need to be unique, we would loose dimensions which have the same size.

As an alternative I propose to use a @list, so we keep the order info more nicely and we don't loose data:

{
          "@type": "cr:Field",
          "@id": "recordset/fieldname",
          "isArray": true,
          "arrayShape": {
            "@list": [
              1,
              2,
              1
            ]
          },
       ...
}

WDYT?

In this PR:

Still TODO: Remove repeated from all datasets and use isArray and arrayShape instead.

@ccl-core ccl-core requested a review from a team as a code owner February 12, 2025 13:01
Copy link

github-actions bot commented Feb 12, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@@ -106,6 +113,11 @@
"name": "record_set_plain_text/split",
"description": "Split to which the example belongs to.",
"dataType": "sc:Text",
"references": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: references coming after source was more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree... I don't know why it got changed, maybe it started to order those alphabetically? I could change it back manually, shall I do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An array of strings used to control the order. We shouldn't do it manually but programmatically. But this is not prio at all - so we can probably let it like this if we don't have time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

@benjelloun
Copy link
Contributor

As we discussed offline, arrayShape can be represented more compactly as a string that contains space separated dimensions, e.g., "1 2 -1 3", where -1 corresponds to a dimension where the length is not specified. Even though this approach is encoding structured information in a string, it's unambiguous and easy to parse, so I think it's acceptable. We already do accept such input in other places, e.g., for bounding boxes.

@ccl-core ccl-core requested a review from marcenacp February 18, 2025 08:26
@ccl-core
Copy link
Contributor Author

ccl-core commented Feb 18, 2025

@marcenacp WDYT if I add a "LATEST_VERSION" attr to CroissantVersion and is_latest_version method to Context in /usr/local/google/home/stanzelc/croissant/python/mlcroissant/mlcroissant/_src/core/context.py ? Then we could add the new elements to the context in rdf.py, include 1.1 to the versions we use for testing in /usr/local/google/home/stanzelc/croissant/python/mlcroissant/mlcroissant/_src/tests/versions.py

See example in cb40bea

@ccl-core ccl-core requested a review from marcenacp February 18, 2025 12:52
Copy link
Contributor

@marcenacp marcenacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Just a few nits.

@ccl-core ccl-core merged commit 284d5a0 into main Feb 19, 2025
12 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Feb 19, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants