Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support IPNI+Carpark indexing from the client #24

Closed
Tracked by #54
hannahhoward opened this issue Mar 14, 2024 · 16 comments
Closed
Tracked by #54

Support IPNI+Carpark indexing from the client #24

hannahhoward opened this issue Mar 14, 2024 · 16 comments
Assignees

Comments

@hannahhoward
Copy link
Member

hannahhoward commented Mar 14, 2024

What

We should support an ipni/offer action that triggers CARPark indexing + IPNI publishing. For now, this would not include the work described in #10 , and we would continue to use the CARPark database, we just would trigger new indexing actions from the client rather than a bucket event.

@hannahhoward hannahhoward changed the title Trigger IPNI+Carpark indexing from client Support IPNI+Carpark indexing from the client Mar 14, 2024
@hannahhoward hannahhoward moved this from Inbox to In Progress in Storacha Project Planning Mar 25, 2024
@hannahhoward hannahhoward moved this from In Progress to Sprint Backlog in Storacha Project Planning Mar 25, 2024
@hannahhoward
Copy link
Member Author

hannahhoward commented Mar 27, 2024

Questions about feasability:

  1. The idea of this ticket is NOT to implement any new content claims work or new IPNI pathways.
  2. The idea is simply to create an invocation that triggers the server code for e-IPFS indexing and subsequent IPNI publishing from the client.
  3. There's a strong argument to separate from Completing content claims integration work with IPNI #10 and @gammazero 's work because that work is complex and may take a while. More importantly @gammazero's work affects the read pipeline, and may have performance implications for that pipeline. Whereas if we can just ship this ticket for Write to R2, that scopes the effects to only the write pipeline. (and shipping just write to R2 has major cost savings)
  4. BUT, this doesn't make sense to do separately from Completing content claims integration work with IPNI #10 if it's extremely hard to do. Moreover, it means we end up with a second separate upgrade step, so that adds some overhead.

Strawman proposal: this invocation should simply be called eipfs/offer to distinguish it as temporary pathway. When the IPNI work completes, eipfs/offer could continue to function for legacy clients until we deprecate it.

In my opinion what is need to finalize this decision is an estimate of how hard triggering the e-ipfs indexing process from a client invocation will be from someone that understands this code well. If it's a lot, we can cancel this ticket and push it later.

I'd like to review this in sprint planning tomorrow and come to final agreement.

Tagging relevant folks who may be able to estimate this well -- @alanshaw @vasco-santos @Gozala

Also tagging @reidlw since this would affect the iteration if it turns out we really can't seperate the IPNI work from write to R2.

@vasco-santos
Copy link

In my opinion what is need to finalize this decision is an estimate of how hard triggering the e-ipfs indexing process from a client invocation will be from someone that understands this code well.

Today bucket event puts the info in a queue that E-IPFS indexing handlers consumer from. @alanshaw knows way better than me here, but while the code described here is quite easy to put together, I think it won't work out of the box. IIRC E-IPFS expects things to go to specific configured S3 buckets, which in this case would not happen with writes to R2 being the thing.

I would also call out that calling this something different from ipni/offer puts us in a way of continue to support old paths and deprecations that I would prefer to avoid.

My understanding of this would be that ipni/offer is sent by the client with the block information, and w3s handler would do the publishing to IPNI without relying on any of the old E-IPFS things. I have been out of the loop of what @gammazero line of work is, I first was expecting that this would be what was happening, but then I heard it was some next step thing. However, I think we need to have soem sort of ipni/offer from the client and hook the handler with IPNI

@hannahhoward
Copy link
Member Author

hannahhoward commented Mar 27, 2024

@vasco-santos got it. so I'm hearing that the client has to do the block indexing for any of this to work?

@vasco-santos
Copy link

@hannahhoward that is my understanding. Alternatively, we could put together a small service in Cloudflare workers that we call to index the data (kind of analogous of of E-IPFS lambdasdoes), but that seems a distraction from shipping the client thing

@Gozala
Copy link

Gozala commented Mar 27, 2024

Here is the plan I wanted to propose:

  1. We implement client side part of the https://github.com/web3-storage/RFC/blob/main/rfc/ipni-w3c.md, that is client can create set of inclusion proofs, combine them with location claim received and then invoke ipni/offer that passes both as a claims bundle.
  2. We implement ipni/offer capability handler that decoders claims bundle and feeds it into https://github.com/elastic-ipfs/publisher-lambda to incorporate them into IPNI advertisements.

This way we do all of the client side work that would be necessary for shipping https://github.com/web3-storage/RFC/blob/main/rfc/ipni-w3c.md and in the future can rewrite ipni/offer capability handler so that it publishes to IPNI in a different format.

At the moment there are couple of unknowns I'd like to flag:

  1. Currently deployed system builds an index so we can trust that claims are correct, that is not going to be the case if we let client submit claims which leaves us with following choices:
    1. Trust that client reports them correctly, and they are incentivized to do so, but be exposed to attacks where malicious actor may make invalid claims potentially misleading claims consumers.
    2. Verify that client reported claims are valid (which is not going to be free) and only accept them if they are valid. And make that validation runnable next to content (that is CF if content is in R2 or in AWS if it is in S3).
    3. Trust but verify, which is what I think we want to do long term. Specifically we can publish claims as is but then lazily verify when they are consumed and republish with our DID to avoid double verification.
    • This is a best option, because computation costs are only incurred if claims are used as opposed to incurring them regardless.
  2. I think there are two options to feed things into https://github.com/elastic-ipfs/publisher-lambda
    1. Push every multihashes from claims bundle into the SQS queue it pulls from that way they will get aggregate with all the others and be published together.
    2. Push CID of the bundle (or CAR derived from it) directly into advertisement queue which will (as I understand it) produce separate advertisements per bundle as opposed to aggregating it with others. This might affect performance profile and per @hannahhoward's point we're probably better off avoiding it now so we can do such change after measuring an impact.
  3. I believe @vasco-santos mentioned today in the call that hoverboard uses some records from dynamo when serving blocks. I suspect we need to create those records from the ipni/offer capability handler, but at the moment I'm not sure what the records are or where they live some investigation is necessary here.

@Gozala
Copy link

Gozala commented Mar 27, 2024

Currently deployed system builds an index so we can trust that claims are correct, that is not going to be the case if we let client submit claims which leaves us with following choices:

  1. Trust that client reports them correctly, and they are incentivized to do so, but be exposed to attacks where malicious actor may make invalid claims potentially misleading claims consumers.
  2. Verify that client reported claims are valid (which is not going to be free) and only accept them if they are valid. And make that validation runnable next to content (that is CF if content is in R2 or in AWS if it is in S3).
  3. Trust but verify, which is what I think we want to do long term. Specifically we can publish claims as is but then lazily verify when they are consumed and republish with our DID to avoid double verification.
  • This is a best option, because computation costs are only incurred if claims are used as opposed to incurring them regardless.

While writing this it occured to me that we could reduce verification burden if we were to embrace blake3 & inclusion proofs or just BAO structure as described in github.com/storacha/RFC/pull/8. Specifically clients could produce claim with inclusion proofs that could be verified without having to fetch content and run compute over and consequently would not require running it next to where data is. I personally would be very interested in exploring this approach even if it may imply more engineering effort to ship this, because overall it would be a bigger win.

@Gozala
Copy link

Gozala commented Mar 27, 2024

While writing this it occured to me that we could reduce verification burden if we were to embrace blake3 & inclusion proofs or just BAO structure as described in github.com/web3-storage/RFC/pull/8. Specifically clients could produce claim with inclusion proofs that could be verified without having to fetch content and run compute over and consequently would not require running it next to where data is. I personally would be very interested in exploring this approach even if it may imply more engineering effort to ship this, because overall it would be a bigger win.

I'm realizing now that we probably don't have blake3 support in pre-signed URLs so we may not be able to do this as easily as I have claimed. We would still need to verify that sha-256 hash matches blake3 and then do all the inclusion claims with blake3. Or we make w3/blob/verify perform actual verification

@hannahhoward
Copy link
Member Author

hannahhoward commented Mar 27, 2024

Responses to unknowns:

  1. Currently deployed system builds an index so we can trust that claims are correct, that is not going to be the case if we let client submit claims which leaves us with following choices:

    1. Trust that client reports them correctly, and they are incentivized to do so, but be exposed to attacks where malicious actor may make invalid claims potentially misleading claims consumers.
    2. Verify that client reported claims are valid (which is not going to be free) and only accept them if they are valid. And make that validation runnable next to content (that is CF if content is in R2 or in AWS if it is in S3).
    3. Trust but verify, which is what I think we want to do long term. Specifically we can publish claims as is but then lazily verify when they are consumed and republish with our DID to avoid double verification.

My inclination for now is to do option i, and then do iii as a follow on to shipping. Again, we are at max scope on this. I think the odds of a third party acting figuring out the claims system and maliciously breaking it in a 1-2 month timespan is low. Again a rational user is highly incentivized not to behave badly, only a truly malicious user would want to do so.

  1. I think there are two options to feed things into https://github.com/elastic-ipfs/publisher-lambda

    1. Push every multihashes from claims bundle into the SQS queue it pulls from that way they will get aggregate with all the others and be published together.
    2. Push CID of the bundle (or CAR derived from it) directly into advertisement queue which will (as I understand it) produce separate advertisements per bundle as opposed to aggregating it with others. This might affect performance profile and per @hannahhoward's point we're probably better off avoiding it now so we can do such change after measuring an impact.

Yes, again let's do i. Cause we're trying to ship and iterate.

  1. I believe @vasco-santos mentioned today in the call that hoverboard uses some records from dynamo when serving blocks. I suspect we need to create those records from the ipni/offer capability handler, but at the moment I'm not sure what the records are or where they live some investigation is necessary here.

Agree this needs coverage. @alanshaw can you provide any insight on how this is done and what we might need to put in this handler?

@hannahhoward
Copy link
Member Author

I would request we leave anything Blake3 or Merkle Reference related out of this. Absolutely commiting to Blake3 as soon as it makes sense in the stack, but not within this set of tickets.

@hannahhoward
Copy link
Member Author

Request, if possible, to have @gammazero work on the the client side of this, since it's part of the IPNI RFC he built. Or at least pair/help with it. I would like @gammazero to understand this client code for the future.

@Gozala
Copy link

Gozala commented Mar 27, 2024

I'm starting to think we may have another unaccounted problem at hand https://filecoinproject.slack.com/archives/C06EDB1NADU/p1711579957117159, specifically I'm under impression that sha256 checksum verification does not appears available in R2 which is a major throwback as lot in our system assumes that uploaded content will correspond to the multihash, even concurrency (of same content upload) is managed by it.

@gammazero
Copy link

At the moment there are couple of unknowns I'd like to flag:

  1. Currently deployed system builds an index so we can trust that claims are correct, that is not going to be the case if we let client submit claims which leaves us with following choices:

    1. Trust that client reports them correctly, and they are incentivized to do so, but be exposed to attacks where malicious actor may make invalid claims potentially misleading claims consumers.
    2. Verify that client reported claims are valid (which is not going to be free) and only accept them if they are valid. And make that validation runnable next to content (that is CF if content is in R2 or in AWS if it is in S3).
    3. Trust but verify, which is what I think we want to do long term. Specifically we can publish claims as is but then lazily verify when they are consumed and republish with our DID to avoid double verification.

I agree with @hannahhoward here. Let's do i and extend that to iii later.

  1. I think there are two options to feed things into https://github.com/elastic-ipfs/publisher-lambda

    1. Push every multihashes from claims bundle into the SQS queue it pulls from that way they will get aggregate with all the others and be published together.
    2. Push CID of the bundle (or CAR derived from it) directly into advertisement queue which will (as I understand it) produce separate advertisements per bundle as opposed to aggregating it with others. This might affect performance profile and per @hannahhoward's point we're probably better off avoiding it now so we can do such change after measuring an impact.

I think 1 is where we want to start. If these get aggregated with everything else, there will not be any way to delete them separately.

For 2, there would need to be new service logic that would:

  • Read claims bundle
  • Extract multihashes from inclusion claims
  • Create IPNI advertisement with multihashes and bundle CID

We will want this, but I am not sure if we need this to begin with.

  1. I believe @vasco-santos mentioned today in the call that hoverboard uses some records from dynamo when serving blocks. I suspect we need to create those records from the ipni/offer capability handler, but at the moment I'm not sure what the records are or where they live some investigation is necessary here.

We need to catalog all the consumers of Dynamo DB (CARPark) so that they can be converted to do IPNI queries.

@gammazero
Copy link

Request, if possible, to have @gammazero work on the the client side of this,

I need some help setting up a dev environment to test client change.

@vasco-santos
Copy link

I am also up for what folks here are suggestion:

  1. ...

Start with option i, and then go to iii. Please note that this is important and should not be a iii that gets deprioritised before. @Gozala and I talked earlier today about at least keeping in the state a "cause" property that links to where given thing came from, so that we can check if it was us or not who compute this in the future

  1. ...

option i

  1. ...

I shared with @Gozala how hoverboard uses mentioned table. A few hints for folks following:

We could not find where schema is defined 🙄 but we can see it in AWS dashboard. The important things are:

  • only hoverboard reads from this data today since E-IPFS got sunset
  • hoverboard derives from information on that table what R2 keys may exist. Tries them first, and then fallbacks to S3. We here should write R2 paths and teach hoverboard to use those
  • all writes to this table should happen in https://github.com/elastic-ipfs/indexer-lambda

A note that I am not any expert on the E-IPFS + Hoverboard part, so I would love someone to double check these assumptions

@hannahhoward
Copy link
Member Author

@gammazero working on encoding the final structure in CBOR

@alanshaw
Copy link
Member

@github-project-automation github-project-automation bot moved this from In Progress to Done in Storacha Project Planning May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

5 participants