core(charset audit): loosen CHARSET_HTML_REGEX and CHARSET_HTTP_REGEX #10389

mathiasbynens · 2020-02-27T08:29:54Z

Summary

While it would be overkill to implement full-blown HTML/HTTP parsers, by simply making the regular expressions case-insensitive we can reduce the amount of false negatives for the charset audit.

This patch also applies some drive-by nits/simplifications.

Related Issues/PRs

Ref. #10023, #10284.

While it would be overkill to implement full-blown HTML/HTTP parsers, by simply making the regular expressions case-insensitive we can reduce the amount of false negatives for the charset audit. This patch also applies some drive-by nits/simplifications. Ref. GoogleChrome#10023, GoogleChrome#10284.

patrickhulce

LGTM, thanks @mathiasbynens!

patrickhulce · 2020-02-27T15:28:54Z

@Beytoven everything looks good to you too? :)

mathiasbynens · 2020-02-27T15:55:47Z

Longer term, it’d be nice to limit the audit to check for UTF-8 specifically. There’s no good reason to use any other encoding nowadays, and Lighthouse could help guide developers towards the well-lit path here. (As a bonus, this would simplify the implementation/these regular expressions as well.)

paulirish

we always love driveby PRs/issues from @mathiasbynens. thank you!

@Beytoven we'll wait for your +1 before merging

Beytoven · 2020-02-27T19:59:01Z

LGTM!

Longer term, it’d be nice to limit the audit to check for UTF-8 specifically. There’s no good reason to use any other encoding nowadays, and Lighthouse could help guide developers towards the well-lit path here. (As a bonus, this would simplify the implementation/these regular expressions as well.)

We thought about checking for specific encoding but at the time couldn't find any solid source on which are allowed/used on the web. Hence, why our regex covers all known encoding. @paulirish recalled some guy making a fuss about why his site was better off not using utf-8 and so we figured it was safest to not enforce a specific encoding. Perhaps this is something to be discussed further.

patrickhulce · 2020-02-27T20:56:46Z

Yes there was someone very pissed about our utf-8 check which is a hilarious read if you have some time :)

#9660

connorjclark · 2020-02-27T21:23:18Z

nit: in the future @Beytoven make sure the core(....) inside bit uses the audit's id.

@paulirish is it possible to add this to commit lint?

connorjclark · 2020-02-27T21:25:57Z

@mathiasbynens I am curious how you use Lighthouse. do you sync a local master copy and always use the latest hotness? :) I ask b/c this audit just landed in master and you're so quick with the fixes

mathiasbynens · 2020-03-02T16:16:53Z

Yes there was someone very pissed about our utf-8 check which is a hilarious read if you have some time :)

#9660

Reading the report, it sounds like that person was misguided. And understandably so; encodings are complicated, and without guidance it can be difficult to make a choice and understand the ramifications. IMHO it'd be great if Lighthouse could guide developers towards UTF-8 here.

We thought about checking for specific encoding but at the time couldn't find any solid source on which are allowed/used on the web.

FWIW, https://www.w3.org/International/questions/qa-html-encoding-declarations recommends the following:

You should always use the UTF-8 character encoding.

As to which encodings are allowed on the web, that's a different question. @annevk et al did lots of research on this and captured it in the Encoding Standard. @annevk, do you have an opinion on whether or not Lighthouse should recommend UTF-8 over other encodings? cc @hsivonen

@mathiasbynens I am curious how you use Lighthouse. do you sync a local master copy and always use the latest hotness? :) I ask b/c this audit just landed in master and you're so quick with the fixes

I'm keeping track of issues (on GitHub) that are relevant to my interests :)

annevk · 2020-03-02T16:27:16Z

https://hsivonen.fi/label-utf-8/ is a good summary of why you want to always use UTF-8. The HTML and Encoding Standards also mandate it.

hsivonen · 2020-03-03T09:29:28Z

I think it makes sense to recommend UTF-8 at least if the page has a form on it, but like Anne said, the standards require authors to use it without such conditions.

(Also, for more complex sites that might transport user-provided content in JavaScript string literals, multiple legacy CJK encodings are a self-XSS risk: whatwg/encoding#171 . And as noted on the page Anne linked to UTF-16[BE|LE] is a self-XSS risk with user-provided content even just in HTML.)

mathiasbynens requested a review from a team as a code owner February 27, 2020 08:29

vercel bot deployed to Preview February 27, 2020 08:29 View deployment

googlebot added the cla: yes label Feb 27, 2020

patrickhulce approved these changes Feb 27, 2020

View reviewed changes

patrickhulce requested a review from Beytoven February 27, 2020 15:28

patrickhulce assigned Beytoven Feb 27, 2020

patrickhulce added the waiting4reviewer label Feb 27, 2020

paulirish reviewed Feb 27, 2020

View reviewed changes

Beytoven approved these changes Feb 27, 2020

View reviewed changes

Beytoven merged commit 7b5051e into GoogleChrome:master Feb 27, 2020

mathiasbynens deleted the CHARSET_HTML_REGEX branch March 2, 2020 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core(charset audit): loosen CHARSET_HTML_REGEX and CHARSET_HTTP_REGEX #10389

core(charset audit): loosen CHARSET_HTML_REGEX and CHARSET_HTTP_REGEX #10389

mathiasbynens commented Feb 27, 2020

patrickhulce left a comment

patrickhulce commented Feb 27, 2020

mathiasbynens commented Feb 27, 2020

paulirish left a comment

Beytoven commented Feb 27, 2020

patrickhulce commented Feb 27, 2020

connorjclark commented Feb 27, 2020 •

edited

Loading

connorjclark commented Feb 27, 2020

mathiasbynens commented Mar 2, 2020

annevk commented Mar 2, 2020

hsivonen commented Mar 3, 2020

core(charset audit): loosen CHARSET_HTML_REGEX and CHARSET_HTTP_REGEX #10389

core(charset audit): loosen CHARSET_HTML_REGEX and CHARSET_HTTP_REGEX #10389

Conversation

mathiasbynens commented Feb 27, 2020

patrickhulce left a comment

Choose a reason for hiding this comment

patrickhulce commented Feb 27, 2020

mathiasbynens commented Feb 27, 2020

paulirish left a comment

Choose a reason for hiding this comment

Beytoven commented Feb 27, 2020

patrickhulce commented Feb 27, 2020

connorjclark commented Feb 27, 2020 • edited Loading

connorjclark commented Feb 27, 2020

mathiasbynens commented Mar 2, 2020

annevk commented Mar 2, 2020

hsivonen commented Mar 3, 2020

connorjclark commented Feb 27, 2020 •

edited

Loading