Skip to content

Commit

Permalink
Add ADR about server side filtering
Browse files Browse the repository at this point in the history
  • Loading branch information
leplatrem committed Jan 3, 2025
1 parent aafe839 commit 24ee112
Showing 1 changed file with 143 additions and 0 deletions.
143 changes: 143 additions & 0 deletions docs/adr/adr_004_server_side_filtering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Server Side Filtering

* Status: draft
* Deciders: mleplatre, acottner, <>
* Date: Jan 3, 2024

## Context and Problem Statement

In some situations, clients only need a subset of the collection records.

And since we compute the signature on the server side for the whole dataset, clients are obliged to also download the whole dataset in order to verify the data integrity.

In this document, we explore possible solutions to overcome this limitation.

## Decision Drivers

In order to choose our solution we considered the following criteria:

- **Complexity**: Low → High: how complex is the solution
- **User experience**: Bad → Good: how adapted to users is the solution
- **Cost of implementation**: Low → High: how much efforts does it represent
- **Cost of operation**: Low → High: how much does the solution cost to run


## Considered Options

1. [Option 0 - Do nothing](#option-0---do-nothing)
1. [Option 1 - Split into several collections](#)
1. [Option 2 - Read-only mirrors](#)
1. [Option 3 - Implement notion of datasets](#)
1. [Option 4 - Implement dynamic signatures](#)

## Decision Outcome

Chosen option: Option 2 because we already have all the pieces in place, it is very low tech and low cost, while still providing a good user experiences for editors and publication of data.

## Pros and Cons of the Options

### Option 0 - Do nothing

No change is made. All clients download everything and filter locally using JEXL targeting.

- **Complexity**: N/A
- **User experience**: Good, users edit a single collection.
- **Cost of implementation**: N/A
- **Cost of operation**: High. All clients download everything, resulting in higher bandwidth costs.


### Option 1 - Split into several collections

The single collections is split into several ones, and the client is able to determine which collection it should pull from (eg. one per region).

If a record is required in different subsets, it is duplicated in each collection.

- **Complexity**: N/A
- **User experience**: Bad, even if a script could automate the publication of records into several collections, editors don't have a single source of truth for the collection, leading to duplicated actions and data.
- **Cost of implementation**: Low. Creating collections is cheap.
- **Cost of operation**: Low. Clients download only the subset of data, saving bandwidth.


### Option 2 - Read-only mirrors

With this solution, the main collection is maintained, and several "*side collections*" are created. As with *Option 1*, the clients are able to pick which collection to pull from.

Unlike *Option 1*, the additional collections are read-only. Editors publish data in the main collection, and a scheduled job will copy the records at regular intervals to the side collections using filters.

The [`backport_records` cronjob]() copies records from collection to another and signs it. It can take querystring filters in order to only copy of subset.

For example, if records have a `regions` field:

```
{
"id": "shop-A",
"regions": ["europe", "asia", "north-america"],
},
{
"id": "shop-B",
"regions": ["europe", "africa"],
}
```

Then different collections can be populated using this field in [a querystring filter](https://docs.kinto-storage.org/en/latest/api/1.x/filtering.html#comparison):

* `shops-africa`: `?contains_regions=["africa"]`
* `shops-europe`: `?contains_regions=["europe"]`
* `shops-asia`: `?contains_regions=["asia"]`

- **Complexity**: Low. It is based on existing pieces of the current architecture.
- **User experience**: Good, because editors only manipulate the main collection to assign datasets and to publish data.
- **Cost of implementation**: Low. Creating collections is cheap and onboarding new instances of the `backport_records` job also.
- **Cost of operation**: Low. Clients download only the subset of data, saving bandwidth, and running cronjobs is cheap.


### Option 3 - Implement notion of datasets

With this solution, we compute several signatures for a single collection.

On the server side, we could introduce the notion of *datasets* for a collection. Datasets could be declared in the server configuration:

```
# config.ini
kinto.signer.datasets.main.shops.shops-africa = ?contains_regions=["africa"]
kinto.signer.datasets.main.shops.shops-europe = ?contains_regions=["europe"]
kinto.signer.datasets.main.shops.shops-asia = ?contains_regions=["asia"]
```

When a change is published in the collection, instead of computing a single signature for the whole collection, we compute a signature for each dataset and put them in the collection metadata:

```
{
"metadata": {
"signatures": {
"shops-africa": {
"filter": "?contains_regions=["africa"]",
"signature": "P2OAvlvj1uhIAafIVgtpuAo3lF4pZWERc...C4rC7c9-EC4cl77R35Qo3hRYg2lKU",
},
"shops-europe": {
"filter": "?contains_regions=["europe"]",
"signature": "MHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEab4HfqYGRLW...",
}
}
}
}
```

The clients would pull the subset of data, and use the specific signature to verify the integrity.

- **Complexity**: Medium-High. We introduce some coupling between server configuration and client behaviour.
- **User experience**: Good, users edit a single collection.
- **Cost of implementation**: Medium-High. Server side code is relatively straightforward, but the different clients implementations have to be modified to support this new type of metadata.
- **Cost of operation**: Low. Clients download only the subset of data, saving bandwidth, and signing datasets on each publication is relatively cheap.


### Option 4 - Implement dynamic signatures

With this solution we sign every server response dynamically.

- **Complexity**: Medium. In terms of architecture, this would introduce any real complexity/
- **User experience**: Good, users edit a single collection.
- **Cost of implementation**: Medium-High, this would represent a big change of approach and thus code.
- **Cost of operation**: High. Autograph would have to scale to the traffic received on the origin servers of Remote Settings.

0 comments on commit 24ee112

Please sign in to comment.