-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core(charset audit): loosen CHARSET_HTML_REGEX and CHARSET_HTTP_REGEX #10389
core(charset audit): loosen CHARSET_HTML_REGEX and CHARSET_HTTP_REGEX #10389
Conversation
While it would be overkill to implement full-blown HTML/HTTP parsers, by simply making the regular expressions case-insensitive we can reduce the amount of false negatives for the charset audit. This patch also applies some drive-by nits/simplifications. Ref. GoogleChrome#10023, GoogleChrome#10284.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @mathiasbynens!
@Beytoven everything looks good to you too? :) |
Longer term, it’d be nice to limit the audit to check for UTF-8 specifically. There’s no good reason to use any other encoding nowadays, and Lighthouse could help guide developers towards the well-lit path here. (As a bonus, this would simplify the implementation/these regular expressions as well.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we always love driveby PRs/issues from @mathiasbynens. thank you!
@Beytoven we'll wait for your +1 before merging
LGTM!
We thought about checking for specific encoding but at the time couldn't find any solid source on which are allowed/used on the web. Hence, why our regex covers all known encoding. @paulirish recalled some guy making a fuss about why his site was better off not using utf-8 and so we figured it was safest to not enforce a specific encoding. Perhaps this is something to be discussed further. |
Yes there was someone very pissed about our utf-8 check which is a hilarious read if you have some time :) |
nit: in the future @Beytoven make sure the @paulirish is it possible to add this to commit lint? |
@mathiasbynens I am curious how you use Lighthouse. do you sync a local master copy and always use the latest hotness? :) I ask b/c this audit just landed in master and you're so quick with the fixes |
Reading the report, it sounds like that person was misguided. And understandably so; encodings are complicated, and without guidance it can be difficult to make a choice and understand the ramifications. IMHO it'd be great if Lighthouse could guide developers towards UTF-8 here.
FWIW, https://www.w3.org/International/questions/qa-html-encoding-declarations recommends the following:
As to which encodings are allowed on the web, that's a different question. @annevk et al did lots of research on this and captured it in the Encoding Standard. @annevk, do you have an opinion on whether or not Lighthouse should recommend UTF-8 over other encodings? cc @hsivonen
I'm keeping track of issues (on GitHub) that are relevant to my interests :) |
https://hsivonen.fi/label-utf-8/ is a good summary of why you want to always use UTF-8. The HTML and Encoding Standards also mandate it. |
I think it makes sense to recommend UTF-8 at least if the page has a form on it, but like Anne said, the standards require authors to use it without such conditions. (Also, for more complex sites that might transport user-provided content in JavaScript string literals, multiple legacy CJK encodings are a self-XSS risk: whatwg/encoding#171 . And as noted on the page Anne linked to UTF-16[BE|LE] is a self-XSS risk with user-provided content even just in HTML.) |
Summary
While it would be overkill to implement full-blown HTML/HTTP parsers, by simply making the regular expressions case-insensitive we can reduce the amount of false negatives for the charset audit.
This patch also applies some drive-by nits/simplifications.
Related Issues/PRs
Ref. #10023, #10284.