Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC 2025: Investigating Schema Normalization #857

Open
Julian opened this issue Jan 16, 2025 · 2 comments
Open

GSoC 2025: Investigating Schema Normalization #857

Julian opened this issue Jan 16, 2025 · 2 comments
Labels
gsoc Google Summer of Code Project Idea

Comments

@Julian
Copy link
Member

Julian commented Jan 16, 2025

Brief Description
JSON Schema is a rich language for expressing constraints on JSON data. If we strictly consider JSON Schema validation (rather than any other use of JSON Schema), in many cases there are multiple ways to express the same constraints. For example, the schema:

{
  "oneOf": [
    {"const": "foo"},
    {"const": "bar"}
  ]
}

will have the same validation outcome on all instances as the schema:

{"enum": ["foo", "bar"]}

One might say that this second schema is in some way "better" than the first one in some way that could be made precise.

The same is true for the schemas {"required": ["foo"]} and {"title": "My Schema", "required": ["foo"]}, and one might say the first one is "better" than the second for the purpose of validation.

We can define two schemas to be "equivalent" if they have this property that any instance is valid under one if and only if it is valid under the other, and if we have two equivalent schemas S and S' we might wish to define an algorithm for transforming these schemas into a form which is "canonical" or "normal" such as above.

There are existing attempts to do this for various use cases, but no central place where a self-contained set of normalization rules are written down and a self-contained tool exists to perform the procedure. Let's try and write a simple one!

Expected Outcomes

  • Investigate the existing implementations of normalization in the wild. There are at least two known ones, one being here.
  • Define a set of normalization rules, with configurability for cases where there are multple reasonable canonical forms
  • Define a set of test cases for schemas which are equivalent under these rules, and for the target canonical form for each set of schemas
  • Write a Python library which performs the normalization and emits the normalized schema
  • Empirically test our normalization procedure by running normalized schemas through Bowtie and comparing whether a given implementation returns the same results

Skills Required

  • An existing understanding of JSON Schema's keywords, which can be used to think about areas which might create possible "denormalization" (e.g. keywords which when used together overlap)
  • Familiarity writing Python, and ideally using JSON Schema from Python
  • Experience testing pieces of software by writing test cases, here likely in the form of writing JSON Schema + instance examples
  • Careful diligence in reading and understanding the existing procedures used (in the link above, as well as in a number of JSON Schema journal articles) and the ability to compare the previous work with each other

Mentors
@Julian

Expected Difficulty
medium

Expected Time Commitment
175

@jviotti
Copy link
Member

jviotti commented Jan 16, 2025

I love this, and I think it has some interesting overlap with my linting proposal: #856. The things that should be normalised can make very good linting rules that we can aim to auto-fix for schemas. If I can help in any way, please count me in.

@benjagm benjagm added the gsoc Google Summer of Code Project Idea label Jan 17, 2025
@Honyii
Copy link
Contributor

Honyii commented Jan 18, 2025

Thank you for your submission Julian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gsoc Google Summer of Code Project Idea
Projects
None yet
Development

No branches or pull requests

4 participants