Parser generates invalid URLs #379

LEW21 · 2018-04-10T21:17:13Z

https://url.spec.whatwg.org/commit-snapshots/a1b789c6b6c36fcdb16311da5abd177e84151fca/#url-parsing

For each byte in buffer:

If byte is less than 0x21 (!), greater than 0x7E (~), or is 0x22 ("), 0x23 (#), 0x3C (<), or 0x3E (>), append byte, percent encoded, to url’s query.

Otherwise, append a code point whose value is byte to url’s query.

This leads to creation of invalid URLs - ones that contain [, \, ], ^, `, {, |, }, which are neither URL code points nor '%' and trigger validation errors:

Otherwise:

If c is not a URL code point and not U+0025 (%), validation error.

I think that either:

the list of valid query characters should be expanded to include more characters, or
these ones should be escaped too.

Related issues: #378, #17

The text was updated successfully, but these errors were encountered:

LEW21 · 2018-04-10T22:35:53Z

Actually, there is a similar problem for other URL parts, too.

For example:

generated path may contain [, \, ], ^, | - which are not valid there;
generated fragment may contain the above and #, {, } - which are not valid there;
if state override is given, valid path may contain ? (which will be then escaped)

LEW21 · 2018-04-10T22:43:30Z

I'd also consider excluding ` from the fragment percent-encode set, like it was excluded from query in #17, so that fragment percent-encode set is contained within query percent-encode set.

annevk · 2018-04-11T03:46:55Z

I don't think we're in a position to change these, unless implementations vary enough for a particular code point that there's some leeway.

What we should maybe do is document the when the parser doesn't "fixup" an invalid URL more clearly.

domenic · 2018-04-11T03:53:35Z

I think it would definitely be worth attempting to change these if at all possible, or else expanding the definition of valid URL.

annevk · 2018-04-11T04:39:36Z

I don't think we can change most of these due to compatibility, but if someone wants to have another attempt at doing the research I'm willing to assist.

I don't think we should change the definition of what's valid either though. Validity in part helps signaling problems you might face, including with legacy software. (Same as with HTML.)

domenic · 2018-04-11T04:59:25Z

It seems extraordinarily strange that we're not only giving developers the tools to create invalid URLs, but we're also encouraging other standards to produce invalid URLs and pass them on to the rest of the ecosystem (e.g., over HTTP). In that case I'd question why we're calling such URLs invalid at all. At that point they're just "URLs produced by all software that follows the URL Standard", and validity doesn't buy us much.

I don't believe you can parse/serialize any string through the HTML parser and get invalid HTML, for example. (Maybe some edge cases exist, but these characters are hardly edge cases.)

annevk · 2018-04-11T05:26:36Z

Other standards? I don't see how this is different from #118.

domenic · 2018-04-11T05:31:52Z

Sure. Every standard that uses the URL parser on user input is currently producing invalid URLs, right? Including standards that then use the URL serializer and send the result to the network or elsewhere.

#118 stemmed from suggestions that the parser should fail when given an invalid URL. This issue is a concern about the parser producing an invalid URL.

annevk · 2018-04-11T06:51:12Z

@domenic they can produce invalid URLs, sure. Just like the HTML parser can produce invalid HTML.

annevk · 2018-04-11T06:51:59Z

#118 was precisely about this. That the syntax (valid URLs) conflicts with the parser.

LEW21 · 2018-04-11T09:00:34Z

@annevk, I'd agree that this is a case of Garbage-In-Garbage-Out, if it would only happen when parsing invalid URLs. However, it also happens when changing components of an URL object:

x = new URL('http://localhost')
x.search = 'a#{}'
x.href // "http://localhost/?a%23{}"

Also, the browsers don't 100% follow the spec here - for example Chrome escapes ^ and | in paths, while the standard says it should not. I'll do more tests to check how browsers behave on all chars in all places.

LEW21 · 2018-04-11T10:07:33Z

OK, these are the results for Chrome, Firefox, Edge: https://docs.google.com/spreadsheets/d/1mSl2N2Wrc7ZdKy2ArhLg0t3EHI2DOdHaDyWMZQByHE4/edit?usp=sharing

Note: I'm not sure if I've set spec behavior of \ correctly.

I was testing with:

U = () => new URL('http://localhost')
s = 'x "#%<>?[\\]^`{|}$'
u = U(); u.pathname = s; u.href
u = U(); u.search = s; u.href
u = U(); u.hash = s; u.href

On Edge, u.pathname = s and u.search = s were throwing errors, so I had to check these character-by-character with loops:

for (let c of s) {try { u = U(); u.pathname = 'x' + c + 'x'; console.log(c, u.href); } catch (e) { console.log(c, e) }}
for (let c of s) {try { u = U(); u.search   = 'x' + c + 'x'; console.log(c, u.href); } catch (e) { console.log(c, e) }}

LEW21 · 2018-04-11T10:25:53Z

Looking at the results, I think that:

^ should be added to the path percent-encode set
` should be moved from the fragment percent-encode set to the path percent-encode set
path, query and fragment should treat [, ] and | as valid
query and fragment should treat ?, \, ^, `, { and } as valid
fragment should treat # as valid

The only confusing one is:

\ in paths - no idea what to do here, I don't really understand the backslash-magic in the spec

annevk · 2020-05-04T13:23:22Z

@LEW21 I had another look.

Encoding ^ for paths is still a reasonable suggestion as Chrome and Firefox both do encode it, but since Safari matches the specification and nobody else made an attempt it seems somewhat unfair to them to change this now and I'm not sure it's much better. @achristensen07 thoughts?
What is encoded in fragments changed in 7a3c69f and both Chrome and Firefox seem to align with that. Safari does not (yet) encode ` however.
\ in paths means / for special URLs per the specification and in most implementations.
I don't think we should change what is valid as per earlier statements.

alwinb · 2021-02-23T16:10:34Z

I've added a number of test cases to test the distinct percent encode sets. The PR (for wpt) is here.

Some observations,

Safari complies in all but the fragment cases. It does not encode additional characters in the fragment.
It seems that Firefox and Chrome do not encode additional characters in the username, password, path, query and fragment of non-special URLs.

Firefox and Chrome deviate from the spec in the following:

Firefox and Chrome also encode ^ in the path of special URLs. Chrome also encodes |.
Firefox and Chrome also encode ' and in the username and password (of special URLs) and Firefox also encodes ..
However Firefox does not encode | in the password, in contrast with the spec.

It's easy to make mistakes with this, so additional eyes would be good.

nox · 2022-06-18T14:28:39Z

FWIW as part of my work I've found some requests using unencoded curly braces and double quotes in URIs' paths. I think it's an NVIDIA update tool or something.

See that test in the Rust crate httparse: https://github.com/seanmonstar/httparse/blob/8bd42db76543dee9f7e172d3f14a6666530f220c/tests/uri.rs#L3682-L3693

annevk · 2024-12-02T09:56:12Z

Safari now does the correct thing for ` in a fragment, ^ is tracked by #607, and @alwinb added comprehensive tests so I'm going to close this one out.

alwinb · 2024-12-02T23:22:43Z

Whilst I have no strong opinion about the exact encode sets (other than for \ which ill try to expand on in #675 when I have some time),
I do want to point out that this issue is not merely about the encode sets themselves but about the fact that a parse-serialise round trip can produce invalid URL strings.

And that as such, this issue is not resolved!(sorry!)

Some suggestions follow.

I understand very well the reasons for the status quo, but it is nonetheless a messy situation:

We currently have to consider valid URL strings, invalid-but-tolerated URL strings, and invalid-and-rejected URL strings. Some of them are defined explicitly in the standard, others implicitly.

Furthermore, an invalid-but-tolerated URL string, may end up as either a valid URL string after a parse-serialise roundtrip, OR as a invalid-but-tolerated URL string depending on the exact input.

I agree with @domenic that the last case is bound to cause confusion. And moreover seems to be fairly easy to resolve.

I found it especially hard to get a full grasp and understanding of what the URL standard entails exactly, until I started to identify and name the different sub classes of URL strings in a similar way as above for myself.

So my advice (which you don’t have to take, yet I will do it in my own work) would be to make some editorial changes, including possibly some naming changes or additional definitions, to make it all more clear.

To back that up with more motivation, the code points that do vs do not end up being encoded is a typical thing that tends to drift across different implementations and applications. And it’s one of the hardest things to get right for implementers, because there are so many different encode sets and special cases:

I summarised in my own document (not published yet, busy with other things) we now have per component type (ie path segment query fragment ao) five different possible behaviours for very specific subsets of code points. A single code point may be:

Valid and not encoded (pass through)
Valid but encoded nonetheless
Invalid but tolerated and not encoded
Invalid and fixed by encoding it
Invalid and rejected, causing a parsing failure

And moreover, with slight differences across special and non special hierarchical URLs and large differences with the opaque path URLs, for which the path is touched as less as possible, for good reasons.

That doesn’t include single and double dot segments, which are also actually decoded.

All this is highly convoluted, it is very hard to understand and implement this correctly. So anything that can help to clarify it or clean it up should be welcome and help adoption of this standard.

annevk · 2024-12-03T07:39:42Z

I think that is already documented quite well through some of the examples right at the top of the URLs section (see also #595): https://url.spec.whatwg.org/#example-url-parsing

I also think it's more complicated than individual code points, e.g., validity of % depends on subsequent code points.

We're also getting really close to having three fully interoperable browser engine implementations (Interop 2024 has really been taken to heart) and quite a few server implementations, so I'm quite optimistic on the implementer side of things. I hope we can declare URLs a solved problem in a couple of years.

domenic · 2024-12-03T08:21:57Z

I think that is already documented quite well through some of the examples right at the top of the URLs section (see also #595): https://url.spec.whatwg.org/#example-url-parsing

The valid column there explains when the input is valid. It doesn't explain when the output is invalid.

I still think my comment from #379 (comment) applies. Let me try rephrasing it.

If the idea of "valid URL string" is to have any meaning at all, separate from "string that is parseable as a URL", I think it means "software and humans should try to produce valid URL strings for exchange with other software and with humans, and not produce invalid URL strings".
In particular, I don't think web browsers should send invalid URL strings to servers.
Currently, it is very possible for web browsers to send invalid URL strings to servers. An example is when the user clicks on <a href="https://example.com/?}">click me</a>.
I also think it's bad that, if we want to encourage people to use the string "https://example.com/?%7D" instead of "https://example.com/?}", we have not given people an easy tool for producing the former string from, e.g., user input, or database entries, or similar. We have only given them the (new URL(input)).href tool, which produces the latter string.
- This is very different than HTML, where in most cases, a parse-serialize roundtrip produces valid HTML. E.g. if you input misnested (invalid) you'll get back misnested.
- Or if you input x you'll get back x, which I guess is still technically not valid because there's no a="" attribute on the  element, but the "obvious" "syntactic" errors have been fixed.
Thus, I think it doesn't make sense to call "https://example.com/?}" an invalid URL string. I cannot see what purpose it serves to have tools call out such strings as invalid, if we have lots of them flying around the web all the time between all sorts of software, and we have provided no tools or algorithm for creating valid URLs.

I don't think any of this detracts from the desire to declare URL parsing solved, or the increasing progress on aligning parsers. From my perspective, this is purely about changing the definition of validity.

annevk · 2024-12-03T09:04:13Z

This is very different than HTML

No? It's very easy to produce invalid HTML, e.g., <div>blah</div>. It's also possible at the syntactic level through APIs. HTML is really quite a bit worse at this.

I cannot see what purpose it serves to have tools call out such strings as invalid

It indicates that it might not interoperate to the fullest extent, admitting the reality that RFC 3986/7 and downstream consumers thereof exist. We could give that up and there are issues for tracking that already, e.g., #753, but that would be quite the change.

LEW21 changed the title ~~"For each byte in buffer: If byte is less than ..."~~ query state: "For each byte in buffer: If byte is less than ..." Apr 10, 2018

LEW21 changed the title ~~query state: "For each byte in buffer: If byte is less than ..."~~ query state: Parser generates invalid URLs Apr 10, 2018

LEW21 changed the title ~~query state: Parser generates invalid URLs~~ Parser generates invalid URLs Apr 11, 2018

domenic added topic: parser topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing) labels Mar 4, 2020

annevk mentioned this issue Apr 28, 2021

Parsing square brackets ([]) in path, query, and fragment #595

Closed

This was referenced May 23, 2021

Encode ^ in pathname #607

Open

Encode ' in username and password #608

Closed

jebrosen mentioned this issue Aug 20, 2021

Debug assertion panics due to invalid URIs from upstream Hyper rwf2/Rocket#1831

Closed

domenic mentioned this issue Oct 21, 2021

Editorial: clarify URL validity #666

Merged

alwinb mentioned this issue Nov 14, 2021

Consider switching to an Order Sorted Algebra model? alwinb/url-specification#13

Closed

alwinb mentioned this issue Nov 22, 2021

Setters can cause failure to roundtrip #651

Closed

annevk mentioned this issue Feb 13, 2023

Allow [ and ] as URL code-points #753

Open

sapphi-red mentioned this issue Nov 7, 2023

Escape characters not allowed by URL specs but not escaped by URL::toString vitejs/vite#10307

Open

7 tasks

annevk closed this as completed Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser generates invalid URLs #379

Parser generates invalid URLs #379

LEW21 commented Apr 10, 2018 •

edited

Loading

LEW21 commented Apr 10, 2018

LEW21 commented Apr 10, 2018

annevk commented Apr 11, 2018

domenic commented Apr 11, 2018

annevk commented Apr 11, 2018 •

edited

Loading

domenic commented Apr 11, 2018

annevk commented Apr 11, 2018

domenic commented Apr 11, 2018

annevk commented Apr 11, 2018

annevk commented Apr 11, 2018

LEW21 commented Apr 11, 2018 •

edited

Loading

LEW21 commented Apr 11, 2018 •

edited

Loading

LEW21 commented Apr 11, 2018 •

edited

Loading

annevk commented May 4, 2020

alwinb commented Feb 23, 2021

nox commented Jun 18, 2022 •

edited

Loading

annevk commented Dec 2, 2024 •

edited

Loading

alwinb commented Dec 2, 2024 •

edited

Loading

annevk commented Dec 3, 2024

domenic commented Dec 3, 2024

annevk commented Dec 3, 2024

Parser generates invalid URLs #379

Parser generates invalid URLs #379

Comments

LEW21 commented Apr 10, 2018 • edited Loading

LEW21 commented Apr 10, 2018

LEW21 commented Apr 10, 2018

annevk commented Apr 11, 2018

domenic commented Apr 11, 2018

annevk commented Apr 11, 2018 • edited Loading

domenic commented Apr 11, 2018

annevk commented Apr 11, 2018

domenic commented Apr 11, 2018

annevk commented Apr 11, 2018

annevk commented Apr 11, 2018

LEW21 commented Apr 11, 2018 • edited Loading

LEW21 commented Apr 11, 2018 • edited Loading

LEW21 commented Apr 11, 2018 • edited Loading

annevk commented May 4, 2020

alwinb commented Feb 23, 2021

nox commented Jun 18, 2022 • edited Loading

annevk commented Dec 2, 2024 • edited Loading

alwinb commented Dec 2, 2024 • edited Loading

annevk commented Dec 3, 2024

domenic commented Dec 3, 2024

annevk commented Dec 3, 2024

LEW21 commented Apr 10, 2018 •

edited

Loading

annevk commented Apr 11, 2018 •

edited

Loading

LEW21 commented Apr 11, 2018 •

edited

Loading

LEW21 commented Apr 11, 2018 •

edited

Loading

LEW21 commented Apr 11, 2018 •

edited

Loading

nox commented Jun 18, 2022 •

edited

Loading

annevk commented Dec 2, 2024 •

edited

Loading

alwinb commented Dec 2, 2024 •

edited

Loading