Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which parties carry what costs of text/turtle changes, and do those outweigh which benefits for whom? #141

Open
RubenVerborgh opened this issue Jan 14, 2025 · 41 comments

Comments

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Jan 14, 2025

Summary

In rdfjs/N3.js#484, I learned that the specifications intend to redefine the set of valid documents under the text/turtle media type (and presumably others).

Such a change might not be possible/desired, or should at least be acknowledged as a breaking change, with a resulting cost/benefit analysis.

Definitions

  • text/turtle as the media type defined by https://www.w3.org/TR/turtle/
  • valid-turtle as the (infinite) set of valid Turtle 1.1 documents
  • invalid-turtle as the (infinite) set of documents that are not in valid-turtle
  • spec-compliant Turtle parser as a piece of software that:
    • for each document in valid-turtle, produces the corresponding set of triples
    • for each document in invalid-turtle, rejects it (possibly with details on the syntax error)

Note here that the above definition includes rejection; the 1.1 specification text does not, its test cases do.

Potential problems

  1. Retroactively changing the definition of text/turtle breaks existing spec-compliant Turtle parsers, as they will incorrectly label valid text/turtle documents as invalid.
  2. There is no way to distinguish Turtle 1.1 from Turtle 1.2.
  • While 1 could be argued away as "1.1 parsers only break on 1.2 Turtle", it's a problem that the parser will not be able to tell you why it breaks. Does it break because it's invalid Turtle 1.1? Does it break because it's valid Turtle 1.2? Does it break because it's invalid Turtle 1.2, despite this document intending to be within the 1.1 subset? i.e., should or shouldn't it have worked with this particular text/turtle document and no other context?
  1. Building on 2, neither new nor old parsers will be able to fully automatically validate Turtle documents, since they need to be told out of band whether to validate for 1.1 or 1.2.
  2. Because of the closed-set nature of text/turtle in the Turtle 1.1 spec, any changes to that set (whether deletions or additions) would contradict the Turtle 1.1 spec itself / make it invalid.
  3. The problem will happen again in RDF 1.3.
  4. As a more specific instance of 3, there is no standards-based way for clients or servers to indicate they only support Turtle 1.1, nor to discover whether recipients support Turtle 1.1 or 1.2 (or 1.3), as Accept: text/turtle does not tell them. Nor does Content-Type: text/turtle tell them whether their parser can handle the contents, and we could be 20 gigabytes in until we notice it doesn't.

Analysis

Unlike formats like HTML, Turtle 1.1 does not contain provisions for upgrading. The specification assumes a closed set of valid documents. We find further evidence in a number of bad test cases (https://www.w3.org/2013/TurtleTests/), which explicitly consider more permissive parsers to be non-compliant.

There is a note in the spec (but only a note, and thus explicitly non-normative):

This specification does not define how Turtle parsers handle non-conforming input documents.

but this non-normative statement is contradicted by the bad test cases, which parsers need to reject in order to produce a compliant report.

Although the considered changes for 1.2 are presumably not in contradiction with those bad cases, the test suite was not designed to be exhaustive. Rather, the 1.1 specification considers text/turtle to be a closed set, and the test cases consider a handful of examples to verify the set is indeed closed.

In particular, no extension points where left open on purpose.
Therefore, the 1.1 spec is not only defining “Turtle 1.1”, but also strictly finalizing text/turtle.

(The IANA submission's reservation that "The W3C reserves change control over this specifications [sic]." does not change the above arguments.)

Potential solutions

A set of non-mutually exclusive solutions, which each cover part or all of the problem space:

  1. Factual disagreements with the above.

  2. The introduction of a new media type.

  3. The introduction of a new profile on top of the existing text/turtle media type.

  4. A change to the Turtle 1.1 spec that adds extension points or otherwise opens the set of text/turtle.

  5. Syntactical support in Turtle 1.2 for extension and/or versioning.

@afs
Copy link
Contributor

afs commented Jan 14, 2025

Thank you for the analysis.

We do have https://www.w3.org/TR/rdf12-turtle/#changes-12 and the matter will be in "RDF 1.2 New".

The WG is discussing levels of conformance.

@afs
Copy link
Contributor

afs commented Jan 14, 2025

Related:
There are links to specific versions of documents for both RDF and SPARQL:

https://www.w3.org/TR/rdf12-turtle/ -- currently, the 1.2 working draft. This will be the REC when published. Title "RDF 1.2 Turtle".
https://www.w3.org/TR/rdf11-turtle/ -- The RDF 1.1 published standard. Title "RDF 1.1 Turtle".
https://www.w3.org/TR/rdf-turtle/ -- Tracks the latest publication. Currently, 1.1.
https://www.w3.org/TR/turtle/ -- old name, tracks "rdf-turtle".

@RubenVerborgh
Copy link
Member Author

The WG is discussing levels of conformance.

Interesting; may I suggest a standards-based mechanism for agents to indicate this level? (A media type or profile comes to mind.)

Or would “classic conformance” de facto be the same as effectively only parsing the RDF 1.1 subset (in which case it would be equivalent to one of the points above)? However, this does not seem to be the case, with for instance base directions being added to literals (in which case “classic” might be a confusing/misleading term).

@afs
Copy link
Contributor

afs commented Jan 15, 2025

[This is not a WG response]

Any approach for versioning can have costs on both the reader-side and the writer-side.

For example, anything in the HTTP header to make the data consumer's task easier puts a requirement on the data producer. In the same way that RDF 1.2 syntax can be a long way in the delivered stream for a reader, having an HTTP header with the information makes the writers life harder because it may need to see all the data first - no stream writing without recording the details of which version is in the data, which would also be a producer-side burden.

One way to publish data is using a web server support for mapping file extensions to Content-Type headers -- .htaccess (httpd) or types{} (nginx) etc. This also appears with data dumps in archives such as zip.

Today, a tool kit may need to "know" to look in the URL to get the content type if no realistic content-type is available.

Given the file extension situation, I think any solution will not help RDF that much. Software will want to handle the static/non-profile/file-extension/... cases anyway. Only a domain specific (i.e. consumer and producer) deployment can be sure the global rules are in-play.

There is a trade-off of whether the long term continued provision of a migration solution is a greater burden than the evolution itself. Such migration should never be withdrawn -- "The web is not versioned".

@RubenVerborgh
Copy link
Member Author

Thanks, @afs. I want to leave space for others so will be brief, but quickly:

  • Not explicitly indicating feature/version/… support also incurs costs.
  • Your answer covers the case where such indications happen in the message; different arguments and trade-offs apply when they happen in the body. As a quick example, a first-line @version 1.2 or @features literal-direction would cause a desired fail-fast on 1.1 parsers, and assist 1.2 and future parsers.

@niklasl
Copy link

niklasl commented Jan 16, 2025

Maybe an optional version or feature declaration, to support fail-fast detection? With the implicit being "latest REC". It should perhaps be clearly stated that implementations are required to follow the evolution of the format; with the reciprocal requirement of evolving the format responsibly, aspiring to standardize once "sufficient" implementation coverage has been established. AFAIK, there is a requirement of multiple independent implementations; perhaps that number should be a function of the "cardinality of known deployments" and "how viable it is to upgrade them"? (I know it is a practical impossibility to quantify that on the web scale, but it goes to show awareness of the complexity underlying these judgement calls. And that we (W3C members) have a responsibility to care and cater for cooperative evolution to ensure web interop.)

I think this follows the conventions @afs referenced, which is a trade-off I'm cautiously in agreement with. Defining a new format (mime-type + suffix) is the only other viable option AFAICS; and while that caters for more overlap in deployments, it also induces a certain inertia and growing technical debt. (When is the previous format "sunset"? How is the data quality impacted during the overlap period? How do applications take the difference in expressivity into account?)

I see no practical way around some form of social contracts, as even content negotiation is not merely technical (q=0.9 ...). The most important contract is for publishers to avoid utilizing new features until their consumers have been notified and been able to upgrade; balanced with the need for precision in the domain of discourse among those who already have (we form a web after all).

@RubenVerborgh
Copy link
Member Author

With the implicit being "latest REC"

There is a trade-off of whether the long term continued provision of a migration solution is a greater burden than the evolution itself.

"The web is not versioned".

The key difference being that—for example—HTTP, HTML, and CSS have explicit behaviors on how to deal with unsupported constructs. HTTP proxies have rules on how to deal with unknown headers, HTTP has version negotiation, HTML has rules for unknown tags and attributes, CSS has rules for unsupported properties and even syntax.

So the Web's ability to be non-versioned is baked into the design of those technologies. Conversely, RDF adopting the non-versioned philosophy does not equate doing nothing on the feature support/versioning front, but rather being very explicit about how non-versioning is to be made possible.

In summary, not doing anything put us on neither a versioned nor a non-versioned trajectory. They are not binary opposites, with the third option “incompatible with versioning and non-versioning” being the unfortunate default choice.

@lisp
Copy link

lisp commented Jan 16, 2025

to take a concrete example as precedent, i do not recall that in the transition from sparql 1.0 to sparql 1.1 was the continued use of the same media type designators was problematic.

in what sense other than the concern about "late failure" for large documents should that matter for document media types?
the notion, that 1.2 documents would be marked may seem attractive, but to fail early would still require a change to import control flow.
that, where the inability to modified deployed 1.0 version resources is central to the problem.

@RubenVerborgh
Copy link
Member Author

to take a concrete example as precedent, i do not recall that in the transition from sparql 1.0 to sparql 1.1 was the continued use of the same media type designators was problematic.

Apples and oranges.
SPARQL is not a data language, nor is it problem-free.
The context of a query language is very different, including:

  • limited average and typical document length
  • different consequence of failure, with immediate and specific feedback
    • failure is in fact is sometimes triggered deliberately for endpoint feature discovery
  • absence of streaming parsing
  • different reuse context: individual queries tend to be sent to specific endpoints

So the upgrade path of SPARQL is much more similar to that of SQL, with similar challenges and non-issues.
Not comparable to that of HTTP, HTML, CSS, RDF.

And quite a pain in practice: one typically needs to know out-of-band what precise SPARQL endpoint software an interface is running, which determines how well certain SPARQL 1.0 or 1.1 features are supported.

In contrast, at least today, text/turtle has been 100% unambiguous since the introduction of the spec.
If anything, let's not go the SPARQL route.

in what sense […] should that matter for document media types?

RDF is about enabling interoperability. Yes, on the semantic level, but not having interoperability on the syntactical level precludes that.

In the pre-1.1 days, “Turtle” had been around as a format for over a decade, and parsers were incompatible with each other. It was quite the nightmare, trying to exchange data or write parsers. There was no established (let alone standard) way of knowing what subset was supported by everyone. The Turtle standard solved this by bringing certainty about what is and isn't text/turtle.

The proposed re-definition of text/turtle without any explicit indication, sends us back on a path where parsers may or may not be compatible with a certain Turtle version, and they can't even tell us. We cannot ask servers or clients. We have to know what software they are running. Not exactly the automated interoperability goal.

other than the concern about "late failure" for large documents

One might not even know. One could've parsed a 1.2 document wrongly without ever knowing. One could've rejected or accepted a document based on the wrong assumption (because assumptions are all you have, in band). One doesn't know if downstream systems are compatible with 1.1 or 1.2, because they can't tell.

It's an absolute interoperability nightmare that systems don't even have the words to express what they do and do not support. In a context where we're advocating for semantic interoperability, failing at syntactic interoperability is a serious flaw from a technical and strategic perspective. It adds a serious degree of brittleness, the details of which only a small group of people understand, which carries a major risk of reflecting badly on RDF as a whole for not being a sustainable—let alone interoperable—technology. People will say that RDF doesn't work reliably across systems, and they will be right.

@lisp
Copy link

lisp commented Jan 16, 2025

SPARQL is not a data language, ...

that may be, unless one is concerned with sparql processors.

@lisp
Copy link

lisp commented Jan 16, 2025

RDF is about enabling interoperability. Yes, on the semantic level, but not having interoperability on the syntactical level precludes that.

we agree - vehemently.
as much as ambiguous recommendations is not the answer, neither is error signalling and handling.
would graph store protocol endpoint service descriptions provide sufficient information to the architectures which you envision, in order for them to more effectively control requests?

@Ostrzyciel
Copy link

2 cents from someone who did implement a non-standard RDF format that has an analogue of Ruben's proposed @version 1.2 or @features literal-direction – it sounds like a nice idea, but implementing a serializer that would reliably set such flags is a pain. You essentially need to predict the future: "will this document need 1.2 features or not?" This may seem like a trivial question if we are dealing with a small piece of metadata on the Web, but is completely impossible if we have something like a database dump or any other long stream of data.

I ended up making the serializer always claim that all features are used by default. Then, it's up to the user to tell the serializer that "this and that" feature won't be needed. This creates an obvious compatibility problem, because parsers will simply refuse to read these files, even though in practice the feature may not be used. I have not found a better solution to this problem. I think this is a sensible compromise for my ugly format, but I would be against this in W3C formats. More details here.

Overall, I think a sensible solution would be to embrace the mess and just live with the fact that RDF formats can evolve. I would also like to ask the WG to kindly consider producing some "best practices" for how to mark that an RDF file is 1.2, in a use case-specific manner. I like the suggestion from @lisp for adding some info in graph store protocol descriptions. I'm also curious if something like a non-mandatory HTTP header would be an option. Or maybe a comment at the start of the file (like a shebang in .sh files) – of course, entirely optional. (disclaimer: I did not think these ideas through, they may be VERY bad)

@HolgerKnublauch
Copy link

Intuitively to me it sounds like TTL documents that use any of the new features need a new media type and file ending.

@lisp
Copy link

lisp commented Jan 16, 2025

I'm also curious if something like a non-mandatory HTTP header would be an option.

legacy software will not see them.
placing them such that the control flow of those components will have to be aware of them is not effective.
a service description, based on which a higher-level process can orchestrate operations would be much more effective.

@namedgraph
Copy link

namedgraph commented Jan 16, 2025

Isn't the situation with Turtle 1.1 and Turtle 1.2 a bit like Turtle and TriG? In both cases the former syntax is a subset of the latter.
With Turtle and TriG we got distinct media types (text/turtle and application/trig to be exact). Why shouldn't the same apply to Turtle 1.2?

@dr0i
Copy link

dr0i commented Jan 16, 2025

Consuming data which is suddenly turtle 1.2 (coming with unchanged MIME text/turtle) and this now breaks my formerly working turtle parser (say a widely used library ) is like an API break resulting in a non-working program.
So this is bad.
To avoid this developers providing code provide different versions of libraries, and provide these over time, marking those with API breaks resp. those who should be compatible, using semantic versioning.
It's unlikely that data deliverers would provide different turtle versions even if there would be a HTTP header (or other things) for that.
I ACK the problem, but tend to see it like @niklasl ("I see no practical way around some form of social contracts").
(BTW, even if we only change our data schema, not the RDF version, we also call this out as a possible API break to our customers, as even this can break consumers programs).

@coolharsh55
Copy link

Hi. My thoughts on this from a practicality perspective: I echo Ruben's argument that we should be aiming to support interoperability and backwards compatibility - especially when we know exactly how and why an existing system will break due to new changes. For Turtle, the mime type can be versioned - there is precedent for this if we look at existing mime types.

If we don't version the mime type, existing systems will break. They will need to be updated to support turtle 1.2. There is no way to distinguish between turtle 1.1 and turtle 1.2, so there is no way for them to silently fail or ignore turtle 1.2. There isn't also a way to fail with context i.e. failed as it doesn't handle turtle 1.2 - it will fail equally for valid turtle 1.2 and invalid turtle 1.1. So this is not a trivially fixable change. Not desirable IMO.

If we do version the mime type, existing systems will not break. If they have to support turtle 1.2, then they MUST change or be updated anyway since turtle 1.2 requires updates anyway, and hence there is an opportunity for these systems to add the mime type handling change alongside the turtle 1.2 handling changes. It might result in some extra work, potentially some complex cases as there is mime type handling. However, we know for sure that existing systems won't break (assuming the mime type is used as intended here), and if they do get an incorrectly assigned mime type then the fix is to use the correct mime type. So this should be the desirable state.

This also brings up the question of what should happen when Turtle 1.3 eventually is required. Again versioning the mime type is an option, but pragmatically, having the version in the document itself is the best forwards-compatible solution and a known best practice. It would be ideal to have it here.

@rubensworks
Copy link
Member

rubensworks commented Jan 17, 2025

Another important consideration to take into account here is the length of Accept headers when doing requests within a browser.

Long accept headers in browsers are problematic

The Fetch spec (CORS section) specifies that each header (including the Accept header) is limited to 128 characters.
But even this limit is already causing issues in practise when just taking into account today's RDF media types for content negotiation.

As an example, the Comunica query engine uses the following Accept header by default, which contains 324 characters:

Accept: application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18,application/json;q=0.135,text/shaclc;q=0.1,text/shaclc-ext;q=0.05

Hence, when we do these requests in a browser, we must splice this Accept header to 128 characters, which causes some (valid) RDF media types to not even get requested to the server.

New media types exacerbate this problem

As such, I believe introducing new media types for each RDF serialization in 1.2 is not the right way forward.
Because this would essentially halve the number of media types that can be requested from a server within a browser environment.

For example, the following (which contains some arbitrary new media types for 1.2) already reaches the limit according to CORS:

Accept: application/n-quads,application/n-quads-12,application/trig;q=0.95,application/trig-12;q=0.95,application/ld+json;q=0.9

And this problem would only get worse for every new RDF version:

Accept: application/n-quads,application/n-quads-12,application/n-quads-13,application/n-quads-13,application/n-quads-14

Towards a solution

My initial thought when reading this issue was that profile-based negotiation could be a good solution,
but this is not very compatible with CORS either (longer Accept header or new headers that are now allowed by default under CORS).

From this perspective, my feeling is that new media types or profile-based negotiation are not the way to go, and that in-band solutions such as @version might be better (there is precedent for this in JSON-LD's @version).


Not only does this problem apply to RDF serialization, it also applies to SPARQL result serializations: SPARQL/JSON, SPARQL/XML, SPARQL/CSV, SPARQL/TSV.

@namedgraph
Copy link

profile-based negotiation could be a good solution

Except that again, no established frameworks (e.g. JAX-RS implementations) support it.

@lisp
Copy link

lisp commented Jan 17, 2025

Except that again, no established frameworks (e.g. JAX-RS implementations) support it.

which is why it is better to implement the logic which verifies availability of the required media type on a higher level.
as long as there are legacy applications, a client application framework will have to validate the service endpoint before the request is made.

@kasei
Copy link
Contributor

kasei commented Jan 18, 2025

In contrast, at least today, text/turtle has been 100% unambiguous since the introduction of the spec.

While that's true for the spec version, I don't think the same can be said for the widespread use of the Team Submission that predates the spec. The same media type was in use for years before Turtle 1.1 was introduced and brought with it changes to the syntax. I'm not sure that's reason to do the same thing again, but this isn't the first time we've been faced with this issue.

@hvdsomp
Copy link

hvdsomp commented Jan 19, 2025

Towards a solution

My initial thought when reading this issue was that profile-based negotiation could be a good solution, but this is not very compatible with CORS either (longer Accept header or new headers that are now allowed by default under CORS).

Just felt like pointing out that there’s also an IETF Internet Draft on profile based negotiation, of which @RubenVerborgh is co-author. It’s been in the works for quite a long time. There’s been renewed interest from the cultural heritage community and even from the W3C where some consider this a topic that falls in the IETF realm. See https://datatracker.ietf.org/doc/draft-svensson-profiled-representations/.

@RubenVerborgh
Copy link
Member Author

widespread use of the Team Submission that predates the spec. […] I'm not sure that's reason to do the same thing again, but this isn't the first time we've been faced with this issue.

Technologically? Been there, done that.

Reputationally? Not so much.
Fifteen years ago, at least we could say: “All of this is mess happens because Turtle isn't yet a standard.”
Without a solution, we'll have to say henceforth: “All of this mess happens because Turtle is a standard. Twice—so far.”

@TallTed
Copy link
Member

TallTed commented Jan 23, 2025

@RubenVerborgh — You pointed to "Extended discussion at https://ruben.verborgh.org/articles/fine-grained-content-negotiation/"

— which included —

Particularly exciting is that multiple profiles can be combined in a single response, in contrast to the single-dimensional nature of MIME types.

First thing, your writing betrays a limited understanding of your topic, as you refer consistently to "MIME types", which are actually "media types", though they are used in a universe of MIME.

Next, I bear relatively recent scars of a years-long effort to convince IETF to follow their own documentation and work with a number of folks (including me) who wanted to extend media types by defining how to interpret multiple + therein. Part of the scarring came from IETF rejecting their own pre-existing profile extension, especially when the value(s) of profile are URIs, because there's a relatively SMALL character count beyond which those profile values are now to be considered malware(!).

In other words -- your "extended discussion" (which is really an extended monologue) has been overtaken by events, and is no longer (if it ever was) applicable.

@mielvds
Copy link

mielvds commented Jan 29, 2025

@TallTed questioning Ruben V's credibility with silly 'tomayto, tomahto' comments is not relevant, nor helpful. That said, valid point about profile.

I haven't seen any syntax change on the Web without a change in the media type. While I find @rubensworks' concerns valid, I don't think there's a way around it. Clients will have to be more picky about what content-type they request and eventually, either text/turtle will gradually dissapear or parsers will have caught up and it won't matter anymore.

@Tpt
Copy link

Tpt commented Jan 29, 2025

I haven't seen any syntax change on the Web without a change in the media type

I am not sure this is true. For example XML 1.0 and 1.1 share the same media type. Similarly HTML, CSS and JavaScript had significant syntax changes (especially the later 2) without media type. On the RDF-related elements, JSON-LD 1.1 changed the default processing mode without media type change, SPARQL got the very large 1.1 update...

@TallTed
Copy link
Member

TallTed commented Jan 29, 2025

@mielvds

questioning Ruben V's credibility with silly 'tomayto, tomahto' comments is not relevant, nor helpful

I don't think I made any 'tomayto, tomahto' comments, silly or otherwise.

@RubenVerborgh pointed to an 8 year old article he had written, as if it had some greater authority behind it than himself, and called it a "discussion", which would usually mean that it involved multiple participants. Indeed, the article page says "This Linked Research article has been peer-reviewed and accepted for the Workshop on Smart Descriptions & Smarter Vocabularies (SDSVoc) following the Call for Papers."

If you follow the link to that Call for Papers, you can see that (emphasis mine) "Short position papers are required in order to participate in this workshop. These are not academic papers but descriptions of the problem you’d like the workshop to discuss and the presentation you would like to offer. ‘Papers’ can be as simple as a short description of a tool or service to be demonstrated and the technologies used. Each organization or individual wishing to participate must submit a position paper explaining their interest in the workshop by the deadline. The intention is to make sure that participants have an active interest in the area, and that the workshop will benefit from their presence."

So, far from being "peer-reviewed", the only authority behind that article is @RubenVerborgh himself, and the "discussion" consists of only 2 comments (made 7 and 6 years ago) from other people (respectively, @VladimirAlexiev, who wanted to clarify the meaning of 1 sentence, and @nicholascar, who pointed to a document then-in-progress which has since moved to Content Negotiation by Profile; W3C Working Draft, 02 October 2023), of which only @VladimirAlexiev's is really about the content of the article, and that only about one phrase in one sentence, to which @RubenVerborgh made a one sentence reply, which did not result in any clarifying change within the article.

I stand by my assessment of the article, and of profile-based content negotiation.

@RubenVerborgh
Copy link
Member Author

For example XML 1.0 and 1.1 share the same media type.

See above; HTML, XML etc. have explicitly created in-document mechanisms for versioning, which is why they can do this. Turtle does not, which is why we have the problem. (N3, in contrast, was explicitly equipped for this reason with an @keyword mechanism.)

your writing betrays a limited understanding

@TallTed Let's keep this thread about tech to save people's inboxes; feel free to send other comments to mine.

@afs
Copy link
Contributor

afs commented Jan 29, 2025

RDF 1.2 does not change the vast majority of RDF data. Triple terms and base direction will be uncommon.

My concern is that we end up "splitting the world" into "RDF 1.1" and "RDF 1.2" yet all RDF 1.1 is valid RDF 1.2 and very often data from RDF 1.2 publisher is valid RDF 1.1.

It is a cost-benefit decision.

JSON-LD didn't change the media type - it was 1.0 compatible (nearly) and did introduce optional @version to control the processing.

Profiles do have a role in asking a RDF 1.2 data publisher that does use triple terms to some extend to return RDF 1.1 compatible unstar data.

@Tpt
Copy link

Tpt commented Jan 29, 2025

+1 to @afs

See above; HTML, XML etc. have explicitly created in-document mechanisms for versioning, which is why they can do this.

Indeed XML have a versioning mechanism. However some examples I took like JavaScript , SPARQL and I believe CSS do not have versioning and syntax introduced in recent versions are strong syntax errors in the previous ones. There are definitely syntax changes on the web without media type changes.

@RubenVerborgh
Copy link
Member Author

There are definitely syntax changes on the web without media type changes

…and never without breakage, which was the initial point: I expect format-based breakage to cause fatal technical and reputational damage to an ecosystem supposedly defined by interoperability.

@RubenVerborgh RubenVerborgh changed the title Does re-use of the same MIME types constitute a breaking change? Which parties carry what costs of text/turtle changes, and do those outweigh which benefits for whom? Jan 29, 2025
@gkellogg
Copy link
Member

The JSON-LD WG discussed this in the 2024-12-11 meeting and recorded as discussion on w3c/json-ld-syntax#436. The feeling is that there are many instances where the content served by different media-types may be changed, and changing it to something else is highly disruptive, to the point of constraining any real adoption of the new versions.

  • text/html has going through many changes.
  • application/jp2 (for JPEC2000) is often ignored and application/jpeg is used instead.
  • Microsoft change .docx from .doc and many other related formats, but these were gross changes from a proprietary media type to something more open. (application/msword => application/vnd.openxmlformats-officedocument.wordprocessingml.document

JSON-LD does have a version announcement feature (@version: 1.1), which is optional but may be mandatory for JSON-LD 1.2.

w3c/rdf-xml#49 (comment) suggests an rdf:version="1.2" attribute which could enable dirLangString and triple terms and could also change the behavior of rdf:ID for reification. But, it's not as easy to accomplish for other formats (Turtle/TriG could potentially introduce @version, but this can't reasonably be done for N-Triples or N-Quads.

@RubenVerborgh
Copy link
Member Author

RubenVerborgh commented Jan 29, 2025

I don't mean to hog the thread, but we're all increasingly talking past each other here.

Lots of true statements above:

  • Breaking changes have been made with and without changes in media type.
  • Non-breaking changes have been made with and without changes in media type.
  • Profiles have and haven't been used for things.

None of that was ever under question originally, nor does it answer the matter at hand.


So let me make my own question that started this thread more explicit and specific:

  • Overall: is it desirable / indeed the best cost–benefit trade-off to make changes to text/turtle?
    • Precedents in either direction exist, and cherry-picking does not help explain our specific trade-off.
  • In more detail:
    • What exactly are the costs / breakages of a proposed text/turtle change?
    • What exactly are the benefits of changing text/turtle compared to any alternatives (regardless of what those are)?
    • Who carries the costs? Who receives the benefits? Are they the same parties, and if not, why is that fair?
    • What percentage of systems will be affected? What percentage of systems are expected to upgrade? What percentage of documents are expected to contain non-backwards compatible statements? (And if that percentage is really low… why is it even worth breaking things?)
    • Are things going to break again with a future update?
  • Finally: Are any of those costs prohibitively expensive technologically, operationally, or reputationally?

@mielvds
Copy link

mielvds commented Jan 30, 2025

I haven't seen any syntax change on the Web without a change in the media type

I am not sure this is true. For example XML 1.0 and 1.1 share the same media type. Similarly HTML, CSS and JavaScript had significant syntax changes (especially the later 2) without media type. On the RDF-related elements, JSON-LD 1.1 changed the default processing mode without media type change, SPARQL got the very large 1.1 update...

Sorry, should have been more nuanced (I meant: changes that cause old parsers to break and not ignore) and you are right. It's what Ruben says: both stategies have been used and probably, the character limit on HTTP headers was a major driver for those not changing. BTW, now that you mention the JSON-LD approach: in that line of reasoning, turtle/trig, of ntriples/nquads might as well have been one syntax with a versioning construct.

After processing what has been said in this thread, my two cents:

  • What exactly are the costs / breakages of a proposed text/turtle change?

I think the benefits would be limited. If we don't change, some (of most) parsers will break...temporarily. If they are still being used and maintained, someone will provide a fix. Else, it's maybe not that big of a problem.

If we do consider this as a huge problem (I don't think it is), you can only do this with a media-type change (this was my previous point).

  • What exactly are the benefits of changing text/turtle compared to any alternatives (regardless of what those are)?

Parsers will able to anticipate or mitigate turtle 1.2 documents without additional changes in protocol or practice and will be able to explain to the user why it's not working, but only for online documents. I wonder whether cannot parse 1.2 provides enough benefit compared to invalid syntax error.

  • Who carries the costs? Who receives the benefits? Are they the same parties, and if not, why is that fair?

The developers of the parsers, consumers of RDF and publishers of RDF both receive costs and benefits. And mostly those operating in a Web context. There's a lot of offline processing too, and that's where a @version would be very welcome.

  • What percentage of systems will be affected? What percentage of systems are expected to upgrade? What percentage of documents are expected to contain non-backwards compatible statements? (And if that percentage is really low… why is it even worth breaking things?)

I don't know, but in my practice, I almost never rely on the RDF mediatypes because I don't process data published on the Web that I don't somehow control.

  • Are things going to break again with a future update?

Probably, but it will be less of a problem because of the 1.2 versioning system that results from this discussion :)

  • Finally: Are any of those costs prohibitively expensive technologically, operationally, or reputationally?

No. I don't think this will cause a reputation problem; it didn't for Javascript, CSS or HTML and much can be prevented by thouroughly communicating the spec beforehand. Web standards think (long), specify, formalise, document, and communicate, and therefore allow for much more migration time than any other format.

@pchampin
Copy link
Contributor

@coolharsh55

If we do version the mime type, existing systems will not break.

They might, though. If DBPedia migrates to an RDF 1.2 system, and there is a new media type for Turtle 1.2, then it is likely that they will now serve their content wit this new media type (see @Ostrzyciel's comment above). Then old client will suddenly become unable to consume DBPedia data ­– even if the data has not changed a bit, and is still effectively Turtle 1.1.

If we don't version the mime type, existing systems will break.

Not necessarily. Following my DBPedia example above: since most of the data will still remain RDF 1.1 compatible, then old client won't even notice the migration.

Of course, I'm not claiming that keeping the same media-type is free of problems. Actually I'm not sure which option I prefer...

@coolharsh55
Copy link

@pchampin

They might, though. If DBPedia migrates to an RDF 1.2 system, and there is a new media type for Turtle 1.2, then it is likely that they will now serve their content wit this new media type (see @Ostrzyciel's #141 (comment) above). Then old client will suddenly become unable to consume DBPedia data ­– even if the data has not changed a bit, and is still effectively Turtle 1.1.

Not necessarily. Following my DBPedia example above: since most of the data will still remain RDF 1.1 compatible, then old client won't even notice the migration.

Assumption that most data will remain 1.1 compatible may not be valid if we do heavy use of 1.2 in the future. My argument for mime types is better for existing 'old clients' that won't be updated any time soon because they won't be served 1.2 content that they cannot handle. If DBPedia or other systems migrate to 1.2 and stop serving 1.1 content, then hopefully it will be an informed decision. This shouldn't be the argument to not provide a way to distinguish between clients. Also the counter-argument should be stated here - if there is no distinction between mime-types and DBPedia starts serving 1.2 content - the old clients are going to run into errors anyway. At least with the difference in mime-type we have 'control' over the failing conditions, and a way to continue to run old clients with the old mime-type.

@TallTed
Copy link
Member

TallTed commented Jan 30, 2025

They might, though. If DBPedia migrates to an RDF 1.2 system, and there is a new media type for Turtle 1.2, then it is likely that they will now serve their content wit this new media type (see @Ostrzyciel's #141 (comment) above). Then old client will suddenly become unable to consume DBPedia data ­– even if the data has not changed a bit, and is still effectively Turtle 1.1.

Well, Turtle 1.1 consumers that rely on some default server behavior might choke on such new Turtle 1.2 serializations that were previously delivered as Turtle 1.1, if this new Turtle 1.2 is made the default and/or the media type remains text/turtle.

But I would expect that Turtle 1.1 consumers that correctly use ConNeg and the Accept: header to request text/turtle should get Turtle 1.1, and those that request the new text/turtle12 (or text/turtle-01-02 or whatever new version-identifying media type is chosen) should get Turtle 1.2.

(I don't think the WG can retroactively shoehorn version info into Turtle documents, nor some other serializations of RDF 1.2, because those serializations were specified as if there would be no changes against which to future proof. I consider this a major error, but it is where we are.)

I'm confident that this will be how things work, because DBpedia is hosted by Virtuoso, and Virtuoso supports nearly if not all RDF serializations. Turtle 1.2 is not yet available, because the WG hasn't finished specifying it, but I expect Virtuoso to natively handle RDF 1.2 and SPARQL 1.2 (possibly with a new /sparql12 endpoint alongside the existing /sparql endpoint, though this might be an admin-configurable option) once these things are full specified.

I don't think the WG has yet considered the full matrix of permutations, of what will happen when mixing tools conforming to each version of RDF and of SPARQL, with data conforming to each version of RDF and queries conforming to each version of SPARQL.

I don't think the WG can conclusively state now whether our new specifications will include some breaking or only non-breaking changes, and thus be 2.0 or 1.2. I think the WG can conclusively state the WG is trying to produce non-breaking changes to 1.2 (as the WG is Chartered to do).

@rat10
Copy link
Contributor

rat10 commented Feb 2, 2025

(I don't think the WG can retroactively shoehorn version info into Turtle documents, nor some other serializations of RDF 1.2, because those serializations were specified as if there would be no changes against which to future proof. I consider this a major error, but it is where we are.)

Naive question: can't we define something - and even if it's only a standardized comment like # turtle 1.2 in the first line - to help in disambiguation? That sure is not perfect, but probably still better than nothing.

@coolharsh55
Copy link

Naive question: can't we define something - and even if it's only a standardized comment like # turtle 1.2 in the first line - to help in disambiguation? That sure is not perfect, but probably still better than nothing.

@rat10 This is from my note on what I see it as: If we do add a magic-first-line but don't change the mime-type then 1.2 parsers have a way to identify whether they are looking at 1.2 or 1.x document, but previous parsers won't understand this and as a result will mark the document as invalid Turtle. Since there is no mechanism in turtle to indicate version or dictate how parsers should check for version compatibility, any solution which relies purely on addition of fields/info will break existing parsers. So AFAIK the options are:

  1. define a magic-string like @version 1.2 as first line to avoid changing mime-type - which will certainly break existing parsers as this is not defined in 1.1, where we get to keep the same mime-type; OR
  2. define a new mime-type for 1.2 so that it has to be specifically requested/handled differently - which will not break existing parsers but will also lock them out of future graphs as they cannot be sure whether a 1.x document is also a valid 1.1 document

Whether 1 or 2 is preferred depends on how one does the cost-value analysis. Both will require changing parsers anyway to handle 1.2. Approach 1 is better considering we don't need to make changes to the stack all the way from content-negotiation / mime-handling bits upwards. Approach 2 is better as it doesn't have a 'penalty' for not changing existing codebases as they won't be given 1.2 with the same mime-type (and they will reject the new mime-type).

My preference is for Approach 2, because even if we take Approach 1 right now, 5-10 years down the line we're going to have this same problem again when there will be a version 1.3 which will again break the then current parsers etc. - so I would prefer not breaking any stuff and updating things now rather than break them now and again at every version change. To do this, we would need a new mime-type e.g. turtlex (for expanded), and add a @version X.y to the turtle spec so that future versions can be declared purely in the document.

@rat10
Copy link
Contributor

rat10 commented Feb 2, 2025

Naive question: can't we define something - and even if it's only a standardized comment like # turtle 1.2 in the first line - to help in disambiguation? That sure is not perfect, but probably still better than nothing.

@rat10 This is from my note on what I see it as: If we do add a magic-first-line but don't change the mime-type then 1.2 parsers have a way to identify whether they are looking at 1.2 or 1.x document, but previous parsers won't understand this and as a result will mark the document as invalid Turtle. Since there is no mechanism in turtle to indicate version or dictate how parsers should check for version compatibility, any solution which relies purely on addition of fields/info will break existing parsers.

I'm not claiming any authority w.r.t. this issue but let me just add that

  • it would probably be much easier to update a parser to check for that comment as opposed to updating it to handle triple terms
  • it would be relatively easy to check if a failed parse began with such a comment, providing potentially helpful debug hints

I.E. such a comment would at least make failures less opaque. And failures seem to be unavoidable anyway. OTOH the comment line by itself should not break anything.

@domel
Copy link

domel commented Feb 2, 2025

Naive question: can't we define something - and even if it's only a standardized comment like # turtle 1.2 in the first line - to help in disambiguation? That sure is not perfect, but probably still better than nothing.

In libraries like ANTLR, comments are usually ignored in the abstract syntax tree (AST) by defining them as HIDDEN tokens. This is because comments do not affect the language's syntax, including them would unnecessarily increase AST complexity, and removing them simplifies further code processing.

In ANTLR, comments can be defined as hidden tokens:

COMMENT: '#' ~[\r\n]* -> skip;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests