Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec and WPT inconsistencies #239

Open
anonrig opened this issue Dec 18, 2024 · 15 comments · May be fixed by web-platform-tests/wpt#49782
Open

Spec and WPT inconsistencies #239

anonrig opened this issue Dec 18, 2024 · 15 comments · May be fixed by web-platform-tests/wpt#49782

Comments

@anonrig
Copy link
Contributor

anonrig commented Dec 18, 2024

What is the issue with the URL Pattern Standard?

There is a web-platform test that is even implemented by Chrome that is not covered with the URLPattern spec.

new URLPattern({ "protocol": "http", "port": "80 " })

This fails on Chromium due to this function validation: https://chromium.googlesource.com/chromium/src/+/main/extensions/common/url_pattern.cc#101

But other than canonicalizePort there is no place that actually validates the validity of the port, and canonicalizePort calls url parser setter which removes leading and trailing spaces which makes "80 " input valid.

Relevant WPT: https://github.com/web-platform-tests/wpt/blob/0c1d19546fd4873bb9f4147f0bbf868e7b4f91b7/urlpattern/resources/urlpatterntestdata.json#L1146

@anonrig
Copy link
Contributor Author

anonrig commented Dec 19, 2024

Another issue @annevk : new URLPattern({ hostname: 'bad#hostname' }); should not throw but there is a WPT that validates that it throws.

{
    "pattern": [{ "hostname": "bad#hostname" }],
    "expected_obj": "error"
  },

This is wrongly implemented on Deno's URLPattern, Cloudflare's workerd and Chromium.

@annevk
Copy link
Member

annevk commented Dec 19, 2024

Hmm, but # is a https://url.spec.whatwg.org/#forbidden-host-code-point so that does seem weird?

@anonrig
Copy link
Contributor Author

anonrig commented Dec 19, 2024

If I understand it correctly: This doesn't fail, hence it shouldn't fail on URLPattern as well:

const url = new URL('fake://dummy.test')
u.hostname = 'bad#hostname';

@anonrig
Copy link
Contributor Author

anonrig commented Dec 19, 2024

Host parser (specifically domain to ASCII with domain and false) strip all trailing values whenever it sees # https://url.spec.whatwg.org/#concept-host-parser

@anonrig
Copy link
Contributor Author

anonrig commented Dec 24, 2024

Another test case is invalid: ada-url/ada@d17f000

If you run the following on Google Chrome, you'll get the following error:

> new URLPattern("https://{sub.}?example{.com/}foo")

VM970:1 Uncaught TypeError: Failed to construct 'URLPattern': Invalid hostname pattern '{sub.}?example{.com/}foo'. Invalid hostname pattern 'example.com/foo'.
    at <anonymous>:1:1

But, canonicalize_hostname() method spec doesn't mention anything about certain edge cases: https://urlpattern.spec.whatwg.org/#canonicalize-a-hostname

Particularly, the following should work and works according to URL spec:

Welcome to Node.js v23.4.0.
Type ".help" for more information.
> let u = new URL('fake://dummy.test')
undefined
> u
URL {
  href: 'fake://dummy.test',
  origin: 'null',
  protocol: 'fake:',
  username: '',
  password: '',
  host: 'dummy.test',
  hostname: 'dummy.test',
  port: '',
  pathname: '',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}
> u.hostname = 'example.com/foo'
'example.com/foo'
> u
URL {
  href: 'fake://example.com',
  origin: 'null',
  protocol: 'fake:',
  username: '',
  password: '',
  host: 'example.com',
  hostname: 'example.com',
  port: '',
  pathname: '',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}

Therefore, this test case shouldn't fail.

@jeremyroman
Copy link
Collaborator

Going to see how many of these I can get through today.

For this example:

new URLPattern({ "protocol": "http", "port": "80 " })

that's not the relevant Chromium code, but instead that we use url::CanonicalizePort here and here, but that function does not strip whitespace, which the URL port setter does as part of the basic URL parser, when it does this:

Remove all ASCII tab or newline from input.

I think this change is probably minor enough (especially since it only makes previously invalid patterns valid) that changing the implementation(s) to match the spec is okay.

@anonrig
Copy link
Contributor Author

anonrig commented Dec 27, 2024

@jeremyroman I think there are more invalid cases like this. I've removed and updated the following test cases on urlpatterntestdata.json. Ref: https://github.com/ada-url/ada/pull/785/files#diff-b8289182a638cd2b53ad33fee3a3aeaeb6fcd4d73e2cd9ebb0974a0fc8c2236f

Appreciate if you could take a look. They are mostly around hostname canonicalize method due to Chromium not using the URL parser.

@jeremyroman
Copy link
Collaborator

Yeah, I'm in the process of looking into what you've said.

For the port example, on further inspection the change I mention does address some whitespace (newlines and tabs) but not spaces. In the Chromium implementation, it also winds through ParsePortFromStringPosition which simply ignores any leading zeroes and any junk after the ASCII digit sequence, whose spec counterpart is here.

@jeremyroman
Copy link
Collaborator

For the hostname ones, the cases of /?# seem quite parallel to the port ones.

The case of \ is quite weird -- during pattern parsing we can't tell for sure if the URL is special so treat it as not, but for interpreting the init dictionary to test/exec etc we really could do so. The spec currently doesn't really try to do anything about that, though. Thoughts?

For the other bad characters, we have comments in Chromium linking to https://issues.chromium.org/u/0/issues/40124263. If it's just a matter of a (long-standing?) Chromium-specific bug I suppose we should probably test the standard behavior, though that kinda suggests it might be tough for us to fix which might motivate the spec changing. Not familiar enough with that bug yet to comment off the top of my head.

@jeremyroman
Copy link
Collaborator

I'm beginning to think that port and possibly also hostname canonicalization should be revisited in the spec, rather than in the implementations and tests. I've looked more at port, but there are other quirks of the current specified behavior. For instance, the pattern 8{000}? looks like it should match 8 and 8000 but in fact matches 8 and 80 because 000 is canonicalized to 0, and for hostnames {4.}?google.com becomes {0.0.0.4}?google.com (which matches 0.0.0.4google.com which isn't even the same domain).

A version of that for ports might be:

  1. The algorithm used when compiling a component for port should simply strip TAB/LF/CR (because URL does that here and elsewhere, though I would also be okay with not doing that at all) and error if any non-ASCII-digit character is present after that.

  2. The algorithm used in process port for init (in the non-pattern case) should behave more like the URL port setter, but not silently no-op like it does (it leaves it set to whatever it was before; here there it doesn't seem reasonable to interpret that as ""). That implies that leading junk characters (excl TAB/LF/CR + digits) throw an exception, but everything starting at a trailing one is ignored.

    Optionally, we could also be stricter than do the same as the previous plus the range and default port checks (which are a little stranger on a fixed part). I think that's more predictable ("80invalid" being "80" or "" depending on protocol, but "invalid80" failing is consistent with the URL port setter but weird).

Curious for your opinions as well as those of @sisidovski. But basically I think the spec might be the thing that ought to give, possibly with changes to the implementations and tests if what it changes to doesn't match them.

@ljharb
Copy link

ljharb commented Dec 30, 2024

{000} normalized to {0} seems very weird to me indeed.

@anonrig
Copy link
Contributor Author

anonrig commented Dec 31, 2024

@jeremyroman All canonicalize methods are basically encountering the same exact issues like hostname because Chromium implementation does not call URL. Since URL specification is a living specification, I don't see how URLPattern can be spec compliant while using URL. There are more test cases that are failing for pathname as well.

I recommend having an "example" implementation to ensure that the behavior is same in all implementations. With the current state, it's not possible to complete implementation and can be spec compliant where the only implementation does not follow the spec...

On top of that, the existing web-platform tests are not explanatory and indeed very cryptic. For example, I can't seem to understand how URLPattern spec uses a hacky solution in a spec in canonicalize_pathname. And I haven't found an answer to #240

I think we should have a call to understand what our options are, and proceed accordingly. I'm more than happy to help!

FYI: I'm working on adding URLPattern to Ada (which is used by Node.js and CF Workers for URL) which will power Node.js and Cloudflare workers URLPattern implementations.

@anonrig
Copy link
Contributor Author

anonrig commented Dec 31, 2024

Another inconsistency. There is a test in WPT that takes a pathname that starts with . and pathname property returns /foo/bar

let u = new URLPattern({ pathname: './foo/bar', baseURL: 'https://example.com' })
console.log(u.pathname) // returns /foo/bar

But it should return ./foo/bar.

The inconsistency comes from Chromium not following the spec. It should check for has_opaque_host, but it checks for opaque post with a flag...

if (url::IsUsingStandardCompliantNonSpecialSchemeURLParsing()) {
    return IsStandard() || (IsValid() && !HasOpaquePath());
  }

Ref: https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/url_pattern/url_pattern.cc;l=225;drc=296a0fe8d8bfa678632bf9b6b4a8af38c998a5e6;bpv=1;bpt=1

@jeremyroman
Copy link
Collaborator

url::IsUsingStandardCompliantNonSpecialSchemeURLParsing() is effectively true for the past few months, and in any case doesn't matter when IsStandard() (which is the case for a plain https protocol). While we should clean up that feature check eventually, I don't think it's a factor here.

As mentioned in #240, though, I think the spec may have the sense of the opaque path check inverted -- we should be shortening only non-opaque paths.

@sisidovski
Copy link
Collaborator

Thank you for the discussion all.

@jeremyroman Regarding the port, stripping TAB/LF/CR in compiling a component makes sense to me. For process port for init, I'm not sure if we should introduce something new to the current algorithm as the current algorithm already uses the result of basic URL parser, which is intuitive and consists with URL. So maybe we can simply update the test not to throw an error first then fix the implementation?

And, the issue of the pathname starting with . in #240 will be solved by inverting the opaque path? > @anonrig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants