Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core(charset audit): loosen CHARSET_HTML_REGEX and CHARSET_HTTP_REGEX #10389

Merged
merged 1 commit into from
Feb 27, 2020

Conversation

mathiasbynens
Copy link
Member

Summary

While it would be overkill to implement full-blown HTML/HTTP parsers, by simply making the regular expressions case-insensitive we can reduce the amount of false negatives for the charset audit.

This patch also applies some drive-by nits/simplifications.

Related Issues/PRs

Ref. #10023, #10284.

While it would be overkill to implement full-blown HTML/HTTP parsers, by simply making the regular expressions case-insensitive we can reduce the amount of false negatives for the charset audit.

This patch also applies some drive-by nits/simplifications.

Ref. GoogleChrome#10023, GoogleChrome#10284.
Copy link
Collaborator

@patrickhulce patrickhulce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mathiasbynens!

@patrickhulce
Copy link
Collaborator

@Beytoven everything looks good to you too? :)

@mathiasbynens
Copy link
Member Author

Longer term, it’d be nice to limit the audit to check for UTF-8 specifically. There’s no good reason to use any other encoding nowadays, and Lighthouse could help guide developers towards the well-lit path here. (As a bonus, this would simplify the implementation/these regular expressions as well.)

Copy link
Member

@paulirish paulirish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always love driveby PRs/issues from @mathiasbynens. thank you!

@Beytoven we'll wait for your +1 before merging

@Beytoven
Copy link
Contributor

LGTM!

Longer term, it’d be nice to limit the audit to check for UTF-8 specifically. There’s no good reason to use any other encoding nowadays, and Lighthouse could help guide developers towards the well-lit path here. (As a bonus, this would simplify the implementation/these regular expressions as well.)

We thought about checking for specific encoding but at the time couldn't find any solid source on which are allowed/used on the web. Hence, why our regex covers all known encoding. @paulirish recalled some guy making a fuss about why his site was better off not using utf-8 and so we figured it was safest to not enforce a specific encoding. Perhaps this is something to be discussed further.

@Beytoven Beytoven merged commit 7b5051e into GoogleChrome:master Feb 27, 2020
@patrickhulce
Copy link
Collaborator

Yes there was someone very pissed about our utf-8 check which is a hilarious read if you have some time :)

#9660

@connorjclark
Copy link
Collaborator

connorjclark commented Feb 27, 2020

nit: in the future @Beytoven make sure the core(....) inside bit uses the audit's id.

@paulirish is it possible to add this to commit lint?

@connorjclark
Copy link
Collaborator

@mathiasbynens I am curious how you use Lighthouse. do you sync a local master copy and always use the latest hotness? :) I ask b/c this audit just landed in master and you're so quick with the fixes

@mathiasbynens mathiasbynens deleted the CHARSET_HTML_REGEX branch March 2, 2020 16:09
@mathiasbynens
Copy link
Member Author

Yes there was someone very pissed about our utf-8 check which is a hilarious read if you have some time :)

#9660

Reading the report, it sounds like that person was misguided. And understandably so; encodings are complicated, and without guidance it can be difficult to make a choice and understand the ramifications. IMHO it'd be great if Lighthouse could guide developers towards UTF-8 here.

We thought about checking for specific encoding but at the time couldn't find any solid source on which are allowed/used on the web.

FWIW, https://www.w3.org/International/questions/qa-html-encoding-declarations recommends the following:

You should always use the UTF-8 character encoding.

As to which encodings are allowed on the web, that's a different question. @annevk et al did lots of research on this and captured it in the Encoding Standard. @annevk, do you have an opinion on whether or not Lighthouse should recommend UTF-8 over other encodings? cc @hsivonen

@mathiasbynens I am curious how you use Lighthouse. do you sync a local master copy and always use the latest hotness? :) I ask b/c this audit just landed in master and you're so quick with the fixes

I'm keeping track of issues (on GitHub) that are relevant to my interests :)

@annevk
Copy link

annevk commented Mar 2, 2020

https://hsivonen.fi/label-utf-8/ is a good summary of why you want to always use UTF-8. The HTML and Encoding Standards also mandate it.

@hsivonen
Copy link

hsivonen commented Mar 3, 2020

I think it makes sense to recommend UTF-8 at least if the page has a form on it, but like Anne said, the standards require authors to use it without such conditions.

(Also, for more complex sites that might transport user-provided content in JavaScript string literals, multiple legacy CJK encodings are a self-XSS risk: whatwg/encoding#171 . And as noted on the page Anne linked to UTF-16[BE|LE] is a self-XSS risk with user-provided content even just in HTML.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants