Explain why valid domain needs to run ToUnicode #817

hsivonen · 2024-02-02T13:17:19Z

What is the issue with the URL Standard?

https://url.spec.whatwg.org/#valid-domain could use an informative note that states the implications of the two-step (both ToASCII and ToUnicode) check. Given the how both use UTS 46 "Processing" and "ToASCII" does more stuff after "Processing", it would be helpful to call out what the second run of "Processing" (as part of "ToUnicode") catches still.

hsivonen · 2024-03-01T09:31:02Z

FWIW, after more progressi with writing code, I'm even more puzzled about what the second run of "Processing" is meant to catch here.

annevk · 2024-03-03T08:06:04Z

I wonder if the difference has disappeared over time. It does seem weird that ToUnicode can now fail apparently, but there's no explicit mention of this.

zacknewman · 2024-04-25T19:17:00Z

Glad I saw this as I too am skeptical about the need to perform the domain-to-unicode algorithm. I've tried generating inputs that fail on step 3 using the below code in Rust using the idna crate, but I have been unable to find such an input:

use core::{ops::ControlFlow, str};
use idna::Config;
fn main() {
    match ('\0'..=char::MAX).try_fold(String::with_capacity(8), |mut input, c| {
        input.clear();
        input.push(c);
        if let Err(val) = idna_transform(input.as_str()) {
            println!("{val}");
            ControlFlow::Break(())
        } else {
            ControlFlow::Continue(input)
        }
    }) {
        ControlFlow::Continue(input) => {
            let mut utf8 = input.into_bytes();
            utf8.clear();
            utf8.extend_from_slice(b"xn--");
            punycode_inputs(&mut utf8, 0);
        }
        ControlFlow::Break(()) => (),
    }
}
fn punycode_inputs(utf8: &mut Vec<u8>, count: u8) -> bool {
    if count < 4 {
        for i in [
            b'-', b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8', b'9', b'a', b'b', b'c',
            b'd', b'e', b'f', b'g', b'h', b'i', b'j', b'k', b'l', b'm', b'n', b'o', b'p', b'q',
            b'r', b's', b't', b'u', b'v', b'w', b'x', b'y', b'z',
        ] {
            utf8.push(i);
            if let Err(val) =
                idna_transform(str::from_utf8(utf8.as_slice()).unwrap_or_else(|_| {
                    unreachable!("ASCII is a subset of UTF-8, so this is fine")
                }))
            {
                println!("{val}");
                return true;
            } else if punycode_inputs(utf8, count + 1) {
                return true;
            } else {
                utf8.pop();
            }
        }
    }
    false
}
fn idna_transform(input: &str) -> Result<(), &str> {
    idna::domain_to_ascii_strict(input).map_or_else(
        |_| Ok(()),
        |ascii| {
            Config::default()
                .use_std3_ascii_rules(true)
                .to_unicode(ascii.as_str())
                .1
                .map_err(|_e| input)
        },
    )
}

Consequently I believe steps 3 and 4 can be removed, but I haven't mathematically proven the domain-to-ascii algorithm is sufficient. I've used these examples as well.

I'm not entirely sure why this redundant check existed. Either because there was a difference back when this definition was introduced in 3bec3b8 or (more likely) I wasn't sure if there was a difference. Fixes #817.

annevk · 2024-11-29T08:22:48Z

I've put up #840 to fix this. I'm somewhat curious why you all implemented the "valid domain" definition. At least within the web platform there's no known caller for it and it's really meant to be more of a syntax explanation as to how to write a domain.

hsivonen · 2024-11-29T09:15:12Z

I'm somewhat curious why you all implemented the "valid domain" definition.

The idna crate had configurability on the points that beStrict affects before my time.
Being able to flip options closer to beStrict=true is useful for being able to use the upstream IdnaTestV2.txt test suite fully. (I say "closer", because prior to Unicode 16.0, the test suite assumed verifyDnsLength=true with the quirk of allowing the trailing dot.)

hsivonen added topic: idna editorial Changes that do not affect how the standard is understood labels Feb 5, 2024

annevk mentioned this issue Nov 29, 2024

Remove redundant domain to Unicode call from valid domain #840

Merged

annevk closed this as completed in #840 Nov 29, 2024

annevk closed this as completed in da212c9 Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explain why valid domain needs to run ToUnicode #817

Explain why valid domain needs to run ToUnicode #817

hsivonen commented Feb 2, 2024

hsivonen commented Mar 1, 2024

annevk commented Mar 3, 2024

zacknewman commented Apr 25, 2024 •

edited

Loading

annevk commented Nov 29, 2024 •

edited

Loading

hsivonen commented Nov 29, 2024

Explain why valid domain needs to run ToUnicode #817

Explain why valid domain needs to run ToUnicode #817

Comments

hsivonen commented Feb 2, 2024

What is the issue with the URL Standard?

hsivonen commented Mar 1, 2024

annevk commented Mar 3, 2024

zacknewman commented Apr 25, 2024 • edited Loading

annevk commented Nov 29, 2024 • edited Loading

hsivonen commented Nov 29, 2024

zacknewman commented Apr 25, 2024 •

edited

Loading

annevk commented Nov 29, 2024 •

edited

Loading