GSoC 2025: Investigating Schema Normalization #857

Julian · 2025-01-16T15:58:20Z

Brief Description
JSON Schema is a rich language for expressing constraints on JSON data. If we strictly consider JSON Schema validation (rather than any other use of JSON Schema), in many cases there are multiple ways to express the same constraints. For example, the schema:

{
  "oneOf": [
    {"const": "foo"},
    {"const": "bar"}
  ]
}

will have the same validation outcome on all instances as the schema:

{"enum": ["foo", "bar"]}

One might say that this second schema is in some way "better" than the first one in some way that could be made precise.

The same is true for the schemas {"required": ["foo"]} and {"title": "My Schema", "required": ["foo"]}, and one might say the first one is "better" than the second for the purpose of validation.

We can define two schemas to be "equivalent" if they have this property that any instance is valid under one if and only if it is valid under the other, and if we have two equivalent schemas S and S' we might wish to define an algorithm for transforming these schemas into a form which is "canonical" or "normal" such as above.

There are existing attempts to do this for various use cases, but no central place where a self-contained set of normalization rules are written down and a self-contained tool exists to perform the procedure. Let's try and write a simple one!

Expected Outcomes

Investigate the existing implementations of normalization in the wild. There are at least two known ones, one being here.
Define a set of normalization rules, with configurability for cases where there are multple reasonable canonical forms
Define a set of test cases for schemas which are equivalent under these rules, and for the target canonical form for each set of schemas
Write a Python library which performs the normalization and emits the normalized schema
Empirically test our normalization procedure by running normalized schemas through Bowtie and comparing whether a given implementation returns the same results

Skills Required

An existing understanding of JSON Schema's keywords, which can be used to think about areas which might create possible "denormalization" (e.g. keywords which when used together overlap)
Familiarity writing Python, and ideally using JSON Schema from Python
Experience testing pieces of software by writing test cases, here likely in the form of writing JSON Schema + instance examples
Careful diligence in reading and understanding the existing procedures used (in the link above, as well as in a number of JSON Schema journal articles) and the ability to compare the previous work with each other

Mentors
@Julian

Expected Difficulty
medium

Expected Time Commitment
175

The text was updated successfully, but these errors were encountered:

jviotti · 2025-01-16T16:10:37Z

I love this, and I think it has some interesting overlap with my linting proposal: #856. The things that should be normalised can make very good linting rules that we can aim to auto-fix for schemas. If I can help in any way, please count me in.

Honyii · 2025-01-18T13:14:46Z

Thank you for your submission Julian

benjagm added the gsoc Google Summer of Code Project Idea label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2025: Investigating Schema Normalization #857

GSoC 2025: Investigating Schema Normalization #857

Julian commented Jan 16, 2025

jviotti commented Jan 16, 2025 •

edited

Loading

Honyii commented Jan 18, 2025

GSoC 2025: Investigating Schema Normalization #857

GSoC 2025: Investigating Schema Normalization #857

Comments

Julian commented Jan 16, 2025

jviotti commented Jan 16, 2025 • edited Loading

Honyii commented Jan 18, 2025

jviotti commented Jan 16, 2025 •

edited

Loading