Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline Turtle-star syntactic sugar and future-proof it for graphs #131

Open
rat10 opened this issue Oct 22, 2024 · 18 comments
Open

Streamline Turtle-star syntactic sugar and future-proof it for graphs #131

rat10 opened this issue Oct 22, 2024 · 18 comments

Comments

@rat10
Copy link
Contributor

rat10 commented Oct 22, 2024

The syntax of Turtle-star still needs some work. Grammar updates for triple terms and occurrences.
#51
did fix a few issues, but left others open. In particular:

  • the annotation syntax uses curly brackets, although those are by convention reserved for graphs - a convention that should be followed to allow for a future extension of RDF-star from statement annotation to graph annotation
  • currently the syntax uses symbols from all over the place: rounded braces for triple terms, a tilde to designate a reifier, pipes inside the annotation syntax - it would be preferable to streamline this to one symbol only (plus different kinds of braces)

Different combinations of symbols have been discussed, see e.g. the threads Reified triple syntax and [syntax] some re-shuffling of braces, pipes, etc in the mailing list archive of September 2024. The discussion went on in October in the Re: Reified triple syntax thread.
My latest take on this (from this mail) is:

  • in the annotation syntax replace curly brackets by square brackets which is customary for lists of attributions
  • use the *symbol as the unifying character that is used in all things RDF-star

This leads to the following syntactic elements:

  • abstract triple terms become <<* :s :p :o *>>
  • reified triple terms become <* :s :p :o *>
  • reification identifiers are preceded by a double asterisk **
  • inline annotations are enclosed by [* :a :b , :c *]
  • eventual reified graph terms may become {* :s :p :o . :x :y :z *}
  • eventual abstract graphs terms may become {{* :s :p :o . :x :y :z *}}

@niklasl collected and investigated examples, some of which I converted to this proposal as UCR with stars. The following is a very short excerpt from there to illustrate the central aspects of this proposal:

# <https://github.com/w3c/rdf-ucr/wiki/RDF%E2%80%90star-for-Detailed-Provenance-in-Cooperative-Union-Cataloguing>

<p1> a :Person ;
   :birthDate "1901" [* :source <wikidata/p1> *] .

<* <p1> :birthDate "1902" *> :source <book/x> .

<http://www.wikidata.org/entity/Q102071> a sdo:Person ;
  rdfs:label "Tove Jansson"@fi ;
  sdo:deathDate "+2001-06-27T00:00:00Z"^^xsd:dateTime [*
            :importedFromWikimediaProject_P143 <http://www.wikidata.org/entity/Q328> 
          *] , 
      "+2001-06-27T00:00:00Z"^^xsd:dateTime [* 
                :statedIn_P248 <http://www.wikidata.org/entity/Q36578> ;
                :retrieved_P813 "+2014-04-24T00:00:00Z"^^xsd:dateTime 
            *] .

<introduction-to-physics> a :Text [*
            rdfs:comment "This is an audio book."@en [*
                    dc:created "2023-05-20T09:14:30Z"^^xsd:dateTime 
                *] 
        *] ;
    bf:classification <literature-education-physics> [*
            bf:assigner <annif> ;
            dc:date "2023-05-20T08:44:06Z"
        *] .

# <https://github.com/w3c/rdf-ucr/issues/26>

<charlesdodgson> :says <* _:b1 :name "Alice" *>, 
    <* _:b1 :birthDate "1852" *> .

[] :givenName "Alice" ** _:d1 ;
    :familyName "Liddell" ** _:d1 ;
    :birthDate "1852-05-04" ** _:d1 .

_:d1 dc:source <https://en.wikipedia.org/wiki/Alice_Liddell> ;
  dc:date "2023-10-23" .

<report> bibo:authorList (<a> <b> <c>) [* 
        dc:source <a> ; ex:disputedBy <c> 
    *] .
<* <report> bibo:authorList (<c> <b> <a>) *> dc:source <c> .
@afs
Copy link
Contributor

afs commented Oct 22, 2024

the annotation syntax uses curly brackets, although those are by convention reserved for graphs

It uses {| and |} as terminals, and that does not preclude { and } for graphs.

@rat10
Copy link
Contributor Author

rat10 commented Oct 22, 2024

the annotation syntax uses curly brackets, although those are by convention reserved for graphs

It uses {| and |} as terminals, and that does not preclude { and } for graphs.

But see:

{ :s :p :o .
  :x :y :z } {| :b :c , :d |} .

Is that readable, is it easy to spot the difference in meaning? I don't think so.

@niklasl
Copy link

niklasl commented Oct 22, 2024

Quick note: <*[]:p[]*> is a valid IRI, so the particular <* *> variant is problematic.

@william-vw
Copy link

Quick note: <*[]:p[]*> is a valid IRI, so the particular <* *> variant is problematic.

Gosh, it is!

But, it's not a problem if a whitespace is mandated behind and before <* and *>, respectively

@kasei
Copy link
Contributor

kasei commented Oct 22, 2024

  • the annotation syntax uses curly brackets, although those are by convention reserved for graphs - a convention that should be followed to allow for a future extension of RDF-star from statement annotation to graph annotation

I'm not opposed to considering alternatives, but I disagree with this framing. We're trying to align syntax between both data and query languages, and in SPARQL, braces are not used conventionally "reserved for graphs" – they're used to indicate scoping.

Also, I agree with Andy about {| being different from { here.

All of the rest of this is obviously subjective preference, but my 0.02:

currently the syntax uses symbols from all over the place: rounded braces for triple terms, a tilde to designate a reifier, pipes inside the annotation syntax - it would be preferable to streamline this to one symbol only (plus different kinds of braces)

I find the use of a single symbol to make it harder to read, not easier.

  • abstract triple terms become <<* :s :p :o *>>
  • reified triple terms become <* :s :p :o *>

If we're trying to coax people into preferring the reified syntax over the use of triple terms, I find this to be much worse, as both forms use the same characters with the repetition of those characters being the only difference. By contrast, I find it easier to visually differentiate << from <<(.

  • reification identifiers are preceded by a double asterisk **

This is a case where I would much prefer the use of different characters in the token to help visually differentiate from surrounding tokens.

  • inline annotations are enclosed by [* :a :b , :c *]

I find the suggestion of using a token including brackets for the annotations to be the most compelling of these suggestions, as it feels naturally aligned with the existing BlankNodePropertyList syntax in both Turtle and SPARQL.

@kasei
Copy link
Contributor

kasei commented Oct 22, 2024

Quick note: <[]:p[]> is a valid IRI, so the particular <* *> variant is problematic.

Is it? I can't figure out how it would be. Brackets in particular are gen-delims, and so belong to the reserved characters.

@niklasl
Copy link

niklasl commented Oct 22, 2024

Quick note: <[]:p[]> is a valid IRI, so the particular <* *> variant is problematic.

Is it? I can't figure out how it would be. Brackets in particular are gen-delims, and so belong to the reserved characters.

No, you're right, my bad. Square brackets are fairly often allowed by "lax" parsers (and exist in URLs in the wild), but aren't strictly allowed everywhere (reserved for IPv6 literals).

However, <*():b()*> appears to be valid (parens being sub-delims). OTOH, that nil isn't allowed in reified triples; right?

(I do think <* *> looks problematic, comparing e.g. <*:s:p:o*> and <* :s :p :o *>. But that's the bikeshedding aspect of this, beside the technical points.)

@william-vw
Copy link

william-vw commented Oct 22, 2024 via email

@niklasl
Copy link

niklasl commented Oct 22, 2024

@william-vw Two reasons come to mind:

  1. Having a regular IRI character as a delimiter and relying on whitespace for differentiation can be problematic for readability (though a topic in itself).
  2. I believe << ... >> stands out more.

(On that note, <<( ... )>> has also grown on me; both since they look clear to me, and simultaneously "low-level"; not part of convenience syntax. They feel a bit like using full IRIs as predicates; allowed but not ergonomic. One of my hypothetical worries with the parens in that form is that I 've been more supportive of reifiers of "links to full lists"; though I can see the possible costs and tradeoffs considered.)

We should consider how much leeway we have at this point. There's already been a lot of work on the syntax documents and tests.

(I've wanted to debate other options before, e.g. "quoted" predicate-object pairs grouped under a subject. But this can easily become a time sink and shake up settled intuitions. I'm more reluctant to do that now.)

@afs
Copy link
Contributor

afs commented Oct 22, 2024

Quick note: <*[]:p[]*> is a valid IRI, so the particular <* *> variant is problematic.

The IRIREF terminal production accepts <*[]:p[]*> and <[]:p[]>.
IRI parsing happens after the grammar AST is determined, and there is no backtracking if IRI parsing fails.

@afs
Copy link
Contributor

afs commented Oct 22, 2024

We should consider how much leeway we have at this point.

Agreed. The WG has been proceeding believing that the large majority of uses RDR and RDF-star CG <<...>> are occurrences not triple terms and went for agreed syntax that basis.

Systems have implemented <<...>> syntax; articles have been written; there are questions about it on StackOverflow.

While the WG is able to make a change, a change to occurrence syntax has a significant impact outside the WG. We don't work in isolation; we work for the community, and the cost should be factored into the decision.

@kasei
Copy link
Contributor

kasei commented Oct 23, 2024

However, <():b()> appears to be valid (parens being sub-delims).

I'm still struggling to figure this one out, though it's harder. I think the parse would go through these productions:

IRI-reference > irelativeref > irelativepart > ipathnoscheme > isegmentnznc

which would make the appearance of the colon a problem. A colon can appear in a relative IRI, but in this case I think you'd need a / before it (as part of ipathnoscheme).

I agree with Andy that this is probably a secondary issue since the SPARQL grammar uses the very lax IRIREF production, which in most systems is probably only validated as an IRI later in the parsing process.

@afs
Copy link
Contributor

afs commented Oct 23, 2024

However, <():b()> appears to be valid (parens being sub-delims).

IRI-reference > irelativeref > irelativepart > ipathnoscheme > isegmentnznc

which would make the appearance of the colon a problem. A colon can appear in a relative IRI, but in this case I think you'd need a / before it (as part of ipathnoscheme).

Yes, or it fails the scheme production.

There's an online service: https://www.sparql.org/iri-validator.html

(It's the old IRI parsing code)

I agree with Andy that this is probably a secondary issue since the SPARQL grammar uses the very lax IRIREF production, which in most systems is probably only validated as an IRI later in the parsing process.

There is also RDF/XML or JSON-LD, where IRIs can be strings passed up by the underlying format parser.

<http:///>, <urn:x:nss>, <urn:nid:nss?k=v> are legal by the grammar of RFC3986 appendix A but violate a URI scheme rule.

@rat10
Copy link
Contributor Author

rat10 commented Oct 23, 2024

We should consider how much leeway we have at this point.

Agreed. The WG has been proceeding believing that the large majority of uses RDR and RDF-star CG <<...>> are occurrences not triple terms and went for agreed syntax that basis.

It's funny that you mention those two definitions of the meaning of << ... >>, because they mark pretty different interpretations:

  • in RDR and pre-CG RDF* it meant asserted referentially transparent type (but occurrence by example)
  • in CG RDF-star it meant unasserted referentially opaque type
  • and currently it means unasserted referentially transparent occurrence

While this wasn't driving the design idea behind <* ... *>, it is indeed an argument pro ditching the << ... >> syntax alltogether. It will be hard to re-establish a common ground f the meaning of << ... >> after it has been changed so profoundly so many times. I don't know about implementations, but a cursory look at ISWC 2024 accepted papers shows two mentions of RDF-star, one of them referring to Olaf's pre-CG version of RDF* [0], the other to the CG report version [1].

Systems have implemented <<...>> syntax; articles have been written; there are questions about it on StackOverflow.

While the WG is able to make a change, a change to occurrence syntax has a significant impact outside the WG. We don't work in isolation; we work for the community, and the cost should be factored into the decision.

The questions might well take a lot longer to go away if we keep << ... >>.

[0] Implementing Usage Control Policies Using Reification with RDF-Star and SPARQL-Star
Ines Akaichi, Giorgos Flouris, Irini Fundulaki and Sabrina Kirrane
http://users.ics.forth.gr/~fgeo/files/ISWC24Poster.pdf
[1] eSPARQL: Representing and Reconciling Agnostic and Atheistic Beliefs in RDF-star knowledge graphs
Xinyi Pan, Daniel Hernandez, Philipp Seifer, Ralf Lämmel and Steffen Staab
https://arxiv.org/pdf/2407.21483

@rat10
Copy link
Contributor Author

rat10 commented Oct 23, 2024

  • the annotation syntax uses curly brackets, although those are by convention reserved for graphs - a convention that should be followed to allow for a future extension of RDF-star from statement annotation to graph annotation

I'm not opposed to considering alternatives, but I disagree with this framing. We're trying to align syntax between both data and query languages, and in SPARQL, braces are not used conventionally "reserved for graphs" – they're used to indicate scoping.

I proposed, tentatively, {* ... *} for RDF-star graph terms. At least that wouldn't clash with SPARQL. Also, scoping in SPARQL is on graph patterns if I'm not mistaken. That rather supports my interpretation that usually braces are used to demarcate graphs, doesn't it?

Also, I agree with Andy about {| being different from { here.

All of the rest of this is obviously subjective preference, but my 0.02:

currently the syntax uses symbols from all over the place: rounded braces for triple terms, a tilde to designate a reifier, pipes inside the annotation syntax - it would be preferable to streamline this to one symbol only (plus different kinds of braces)

I find the use of a single symbol to make it harder to read, not easier.

It may make it harder to read in certain combinations, but it helps recognition. RDF-star syntax has become quite a bit bloated, with three term syntaxes and one identifier syntax. That is a lot just for the annotation use case. Of course, given that the tendency of the WG is to not map the annotation syntax to rdfs:states, another acceptable solution IMO would be to scrap all the syntactic sugar, but I wonder if that will find more support.

  • abstract triple terms become <<* :s :p :o *>>
  • reified triple terms become <* :s :p :o *>

If we're trying to coax people into preferring the reified syntax over the use of triple terms, I find this to be much worse, as both forms use the same characters with the repetition of those characters being the only difference. By contrast, I find it easier to visually differentiate << from <<(.

Those two shouldn't occur in the same context. They may, of course, but the design rather distinguishes between Turtle-star ( << ... >> and {| .. |} or <* ... *> and [* ... *]) and N-triples-star (<<( ... )>> or <<* ... *>>)

  • reification identifiers are preceded by a double asterisk **

This is a case where I would much prefer the use of different characters in the token to help visually differentiate from surrounding tokens.

One argument against the tilde is that it s visually not very strong. I think the ** is better in that respect.

  • inline annotations are enclosed by [* :a :b , :c *]

I find the suggestion of using a token including brackets for the annotations to be the most compelling of these suggestions, as it feels naturally aligned with the existing BlankNodePropertyList syntax in both Turtle and SPARQL.

That is actually the issue that I'm most concerned about.

@afs
Copy link
Contributor

afs commented Oct 23, 2024

they mark pretty different interpretations:

The point is that while that was the intention, the reality is that usage is overwhelmingly occurrences.

@rat10
Copy link
Contributor Author

rat10 commented Oct 23, 2024

they mark pretty different interpretations:

The point is that while that was the intention, the reality is that usage is overwhelmingly occurrences.

Yes, that may be a valid point w.r.t. occurrences, and also w. r. t. referential opacity which only very few people seemed to care about. But it may still spark questions and uncertainty.
And un/assertedness is still a different case.

@rat10
Copy link
Contributor Author

rat10 commented Jan 10, 2025

To take this up again: The current design has many options - abstract triple terms, triple term occurrences, annotation syntax with and without explicit reifiers - and they are syntactically diverging in many different directions, and none of them uses the * anymore, for which the approach is known. I’m not trying to diminish what we are designing, but let’s face it: we might be a bit blinkered and not realize that this new annotation mechanism might for many people be just an extra, an addition on the side that may be useful to them in some special case. Those people will possibly wonder, and maybe be bewildered by and unwilling to get accustomed to such a plethora of syntactic variations. That’s why I’d like to shrink not the number of primitives, but the syntactic variations, to improve recognizability and make it easier to mentally map one syntactic variant to another. An overarching design feature, a common element of all syntactic variants can help usability a lot. Since the type of bracket can not be that element for multiple reasons, and since the * is what this approach is known for, it makes sense to use it for that purpose.
Also, I’m not very fond neither of the tilde (but that might be dismissed as a matter of tast, and ultimately I’d be fine with that) nor the use of curly braces (that I take more serious: they should be reserved for graphs if possible, and it is possible).

I saw that the following comment got a lot of thumbs up:

We should consider how much leeway we have at this point.
[...]

I think this is questionable. The way how the current iteration of the syntax came to be was well intentioned but not very inclusive, and it was done at a time when the chairs had called work on syntax as not on the table. As I said before I was - and still am - fine with people working on it and making updates, but I'm not fine with now calling it the final state, when it never was discussed or even voted on in WG meetings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants