-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More CORB-protected MIME types - safelist-based approach #721
Comments
No good reason for not covering other MIME types - covering only HTML/XML/JSON was an outcome of trying to protect the most resources likely to contain sensitive data while at the same time minimizing the risk of breaking existing websites. I very much agree that CORB should be extended to cover more resources, although I think we'd like to wait with this until HTML/XML/JSON implementation from Chromium reaches Chrome stable channel and doesn't uncover any worrisome issues. I think we have the following options for extending CORB to other MIME types:
PS. I am not sure I understand the |
@anforowicz |
I think the safelist approach is the way forward. That way, new APIs that wanted to use |
I think we should forbid no-cors for new APIs (we already held the line for module scripts). Safelist would be great, but it's a little unclear to me how feasible that is. |
TIL :-) In that case I don't have concerns with including text/webvtt on the blacklist, next to HTML/XML/JSON/etc.
I agree. From Chrome side we can probably use https://crbug.com/802836 for tracking. Before we proceed I think we need to:
Also - if (for whatever reason) we decide to stick with the blacklist approach, then how can we decide what MIME types to cover in addition to HTML/XML/JSON? So far I think I've heard:
|
A good start on the safelist approach would be to list all the APIs that can consume opaque responses. |
I think fonts are CORS-bound and therefore do not need to be considered, but perhaps not in Safari still. However, all of the above are pretty notorious for being served with the wrong MIME type. Perhaps it's slightly better if require |
So the safelist approach would be:
In cases where the destination doesn't filter the sniffing algorithms (eg
Yeah they're CORS requests according to the spec. If Safari's still violating the spec there, they'll probably have to violate/extend the spec here too, to add font mime type checking and content sniffing. |
"destination" might not always be available (not sure if the "network service" in Chromium will be aware of it - I think it tries to shy away from any content::ResourceType knowledge) or trustworthy (the renderer can just lie). Therefore, I'd rather avoid using the "destination" for a decision here.
Currently CORB only sniffs for HTML/XML/JSON to confirm that blocking is okay (sniffing for blocked resources). I think this is a fine approach going forward as well (we can discuss whether similar confirmation sniffing is needed for some other types like PDF or ZIP - I think the answer is "no - not needed" because mislabeling of an image or script as PDF should be hopefully rare). I don't understand why we'd switch this approach to sniffing for allowed resources:
Good to know. This possibly makes things slightly easier.
Yes :-( The proposed restriction of safelist-based blocking to |
I think it's fine to block future (image) formats. CORS can be used for those. |
That step's more of an early-exit. It'll also be handled in the later steps in case of
Isn't it better to further limit the amount of no-cors data that can end up in the process? It just seems like no-cors is a source of so many security issues, so trying to restrict it as much as possible seems like a good thing. |
I thought that I'd mention that @arturjanc and @mikewest published https://www.arturjanc.com/cross-origin-infoleaks.pdf where they list "doesn't protect all resources" as one of major cons of the current CORB algorithm. |
I agree that it is better to limit the about of no-cors data that can end up in a cross-origin process. I don't understand why above you are proposing to use "image sniffing algorithms" rather than continue using the current CORB approach of only doing confirmation sniffing. In other words, I don't understand what is wrong with the following sniffing scheme for CORB:
|
CORB is an additional filter for responses of cross-origin "no-cors" fetches. It aims to provide defense-in-depth protection for JSON, HTML, XML (though not image/svg+xml), and (sometimes) text/plain resources against cross-process CPU exploits. It also makes it harder to use incorrectly labeled resources as scripts, images, fonts, etc. Discussion and further work is tracked by #681 and #721. Tests are in web-platform-tests's fetch/corb directory.
@anforowicz if you want to switch to a safelist-based approach, you wouldn't want to confirm something is HTML (which is not allowed anyway), but rather you'd want to confirm something is an image (which is allowed). I was thinking of splitting up this issue earlier. Create a new issue for how to go about switching to a safelist. And keep this issue for extending the current approach. That might help. |
I guess there might have been a misunderstanding of what each of us means by a "safelist-based approach" - possibly I didn't look carefully enough at @jakearchibald's earlier comment with a more detailed proposal of what is meant by "safelist-based approach". I think there are 2 options of introducing a "safelist-based approach":
I think @annevk and @jakearchibald are proposing Option #2. I think that Option #1 might be easier to implement (with a well-defined, small set of formats to sniff for).
I am not opposed, although I do note that we've accumulated quite a bit of safelist discussion here already... |
Yeah, I'm suggesting option 2. |
I agree that a "safelist"-based approach is correct for the long-term, but it may take some time to implement (considering the non-negligible risk of breaking existing websites). So, I've opened #860 to discuss if in the short-term it might make sense to add extra individual MIME types to the list of types protected by CORB. @annevk - I wonder if you could maybe change the subject of the current issue (#721) to something like "More CORB-protected MIME types - 'safelist'-based approach"? I couldn't figure out how to change the subject myself (maybe I cannot do it as I am not the reporter?)... :-/ |
@csreis and me discussed this yesterday and I'd like to share some more thoughts about the "safelist"-based approach (having CORB block everything other than known safe types like images/audio/video/javascript/stylesheets). These are just more notes about options 1 and 2 from #721 (comment):
|
See also #870. There's many Content-Type values that'll be happily parsed as script. For option 2:
|
For option 2 (sniffing) @acolwell also points out that sniffing is challenging for range requests (which have to be supported for media content). FWIW, the current CORB implementation treats responses to range requests as if the And I would also like to point out that sniffing should ideally be based on the first X bytes of the response (right not Chromium sniffing only looks at most at the first 1024 bytes to avoid increasing the latency). |
What effect does |
I chatted with @jakearchibald and I'm no longer convinced range requests are much of an issue. It seems we first always determine the type of a media resource and then any range requests will go to a fixed URL, meaning it'll only potentially "leak" content of a resource identified to be a media resource. See also #144 (comment). The main thing remaining here is an exact list of file signatures we would want to safelist. And not allow extensions of that list (use CORS). @lukewagner also considers a basic JavaScript-validating function to be doable (if you don't want the performance hit, use a valid Which leaves CSS, getting telemetry on that would be great. |
The URL can change as long as all data is visible. Some research on this #145 (comment). But yeah, response URLs cannot be mixed if data is opaque. And opaque data cannot be mixed with visible data. That's what casued https://jakearchibald.com/2018/i-discovered-a-browser-bug/. |
So, is your proposal to add some state to CORB, so that range requests are handled like this:
Did I get that right? |
Yeah, due to range responses it's not a clean filter, but a filter and some state about prior actions. |
In order to 'unlock' opaque responses that start at a range other than 0, Fetch could require the caller to provide the response for the first chunk so it can be validated. Kinda similar to https://wicg.github.io/background-fetch/#validate-partial-response-algorithm. |
A safelisted MIME type in a |
One thing that might need some clarification is how the state's lifetime is managed. If all previous sniff results need to be stored, then there will be unbounded memory growth. OTOH, if we discard previous sniff results, then we risk that a range request issued after waking up from a long sleep will be incorrectly blocked by CORB. |
Actually, I guess that we can have separate CORB state for each execution context (frame, service worker, etc.) and discard CORB state when the frame goes away. So maybe the memory growth concerns are not a big isssue. And some other heuristics can also help in a non-malicious case (e.g. only storing state for range-requestable resources - for audio and video, but maybe not for javascript, stylesheets and images). I am not sure if there is something that can be done to prevent unbounded memory growth in a malicious case - maybe we can keep at most X entries in the cache and purge least recently used ones when overflowing the threshold. |
Range requests are only a problem for audio/video at the moment I think, but there is some model where we could tie things to the element doing the requests that would minimize unbounded memory growth. I suspect tying it to the lifetime of the fetch group is more reasonable though. I sketched out the algorithm and necessary state below: A user agent has a no-CORS-safelisted requesters set. A request has an associated no-CORS media identifier (null or an opaque identifier). Null unless explicitly stated otherwise. [The idea here is that the no-CORS media identifier is owned by the media element (audio/video only; I'm assuming we won't do range for images without at least requiring MIME types at this point). As part of the element being GC'd, it would send a message to get all the relevant entries from the user agent's no-CORS-safelisted requesters set removed. There might be better strategies available here and it's not clear to me to what extent we need to specify this, but it's probably good to have a model that does not leak memory forever so the set needs to be keyed to something. The fetch group might also be reasonable.] A no-CORS-safelisted media MIME type is ... (audio, video). A no-CORS-safelisted non-media MIME type is ... (image, text/css, JavaScript MIME types). To determine whether to allow response response to a request request, run these steps:
Note: user agents are encouraged to optimize the above algorithm, by taking into account the context where the request originated from. |
I'm not sure I completely understand all the issues here so I apologize if I am missing something. I just wanted to point out that range requests are not only limited to requests initiated by a audio/video tag. Web applications that use Media Source Extensions (MSE) typically use XHR and fetch to get media data and will use range requests to fetch media segments from the middle of a file. I think in most sane scenarios @jakearchibald suggestion of referencing a request that accesses the beginning of the file could work because these applications usually access the beginning of the file to get the "initialization segment" for initializing codecs. There are some "advanced" scenarios where this initialization data might be passed down embedded in the HTML/JS to "save a round trip" and in that case it is likely that an XHR or fetch with a Content-Range would occur on a URL w/o accessing the beginning of the resource. I understand we probably can't accommodate all scenarios but I just wanted to point this out as a thing applications might do. Another thing I noticed while trying to follow this discussion is that the "audio or video type pattern matching algorithm" might not properly detect Basic Media Segments(http://wiki.webmproject.org/adaptive-streaming/webm-dash-specification) as media because they don't contain the EBML header that the sniffing logic is looking for. These types of segment are used to avoid the overhead of resending the initialization segment in each segment request. IIUC though this sniffing only occurs if the Content-Type isn't set correctly so I guess it is ok if this fails. It might encourage sites to fix their Content-Type. |
I think this is the right approach. In the case of opaque data, all responses need to come from the same place. The only thing that can assert this is the API that processes the data (eg the Since |
@acolwell in the case of MSE via |
whatwg/fetch#721 (comment) with some lessons learned.
whatwg/fetch#721 (comment) with some lessons learned. Co-Authored-By: Jake Archibald <[email protected]> Co-Authored-By: Nathan Froyd <[email protected]>
I created https://github.com/annevk/orb to flush out the idea above more concretely. The challenges to me seem:
Thoughts on those more than appreciated! |
There is now a PR for this in #1442. |
@anforowicz @csreis did you do an exhaustive evaluation of MIME types? E.g., any reason
text/webvtt
isn't there? We also blockedtext/csv
for scripts (standardized and in Firefox at least, not sure if Chrome has fixed that yet), can we block that MIME type for CORB?The text was updated successfully, but these errors were encountered: