Changing topology and custom clusters in integration tests #1097

Lorak-mmk · 2024-10-21T13:45:47Z

Problem

Currently all our integration tests are executed on the same 3-node cluster.
That means we are not testing topology changes at all.

Side note: I think that we are also not adequately testing schema changes - there are 5 occurences of ALTER statement and 6 occurences of DROP statement in tests. We should add more integration tests for this too.

There is an issue for using SCT to test Rust Driver but it doesn't fully address the problem:

Any SCT run takes a loot of time and resources and is not easy to set up. That means it can't be used locally by developers - tests to run locally must finish quickly and be easy to run.
It uses cql-stress: it won't t thoroughly test drivers APIs correctness like in-repo tests can.
Being out of repo means it is harder to edit / fix: you need to makes changes here, in cql stress and in sct. That's a lot of friction.

For those reasons I mostly view SCT tests as benchmarks (which is something we severely miss!) and as a guard against some regressions.

What I propose is to have in-repo topology tests, using ccm for cluster management.

As a prerequisite for that we should finish moving tests that need Scylla from unit tests to integration tests. This step may require exposing more private stuff in test_utils module. To avoid having it compiled in user code (and to prevent users from accidentally using it) we should move it behind a feature - or even better, a cfg, like Tokio does: https://docs.rs/tokio/latest/tokio/#unstable-features

Naive approach

The most obvious solution is to have some "Cluster Manager". Test would call this manager with a configuration of a cluster that it needs, and the manager would set up the cluster and return it.

There are significant issues with this approach which in my opinion disqualify it.

The cluster setup time would be included in the test time - which is a problem for timeouts.
There is no good way to reuse clusters because the manager has no knowledge about what clusters will be needed by test that were not yet run.

For the second issue, the best that the manager could do is to cache clusters. If a test finished, don't destroy the cluster until some other test needs different cluster and we don't have spare capacity to keep both.
If some test requests the cluster, and the requested configuration fits the cached cluster, then we can reuse it.

This heuristic is obviously bad, and fails even in simple cases, increasing the amount of cluster setups required. This in turn increases the runtime of the tests. We need to keep the tests quick to not discourage contributors from running them locally. Currently it takes ~3 minutes on my machine to run all the tests, so I have no problem doing that even for each commit. Python driver is the opposite. Running its tests is hard and takes, IIRC, over 1 hour (!) on my machine. It hinders productivity significantly.

Note: afaik rstest, a crate providing fixture capabilities to Rust test, could only help us implement this naive approach, and is thus insufficient here.

Better approach

What we need is to separate a cluster configuration required by test from the test function. Then, the test runner will know in advance all the required configurations and will be able to schedule cluster setups more efficiently, minimizing the amount of them.

The exact algorithm for determining best order of tests and deciding when to run each of them is purposefully not discussed here. We can start with something simple and improve later if needed.

How to do this? We will need custom test harness utilizing inventory crate. Here's a nice article about writing such harness: https://www.infinyon.com/blog/2021/04/rust-custom-test-harness/ . Test would describe the config they need, harness would then collect those and decide on the execution order. As a nice addition, test could also describe the keyspaces / tables it needs. Separating this information would significantly de-clutter test code (currently most of it is ks / table / session setups).

How would the test describe the cluster / ks etc? The specifics of that are not really important here, but I see two potential options, both using a proc macro.

Desribing them inline. It would look a bit like this:

#[scylla::integration_test(cluster=Cluster(nodes=3, tablets=off, exclusive=true), ks=( parameters for created keyspace), table=(parameters for created table), session = (parameters for created session))]
test_fn(cluster, session, ks_name, table_name) {

}

Using a function for this (which is very similar to how typical fixtures work)

fn cluster_4_nodes() -> ClusterConfig {
	return ClusterConfig {
		tablets: false,
		nodes: 3
		....
	}
}

fn keyspace_rf_3() -> KeyspaceConfig {
	return KeyspaceConfig {
		...
	}
}

#[scylla::integration_test(cluster=cluster_4_nodes, ks=keyspace_rf_3, table=..., session = ...)]
test_fn(cluster, session, ks_name, table_name) {

}

First approach alone would probably quickly become ugly because of long macro incantations.
Second approach is imo better, but even better would be having both - it would allow a one-off configs without separate functions while still providing cleanliness of the second solution where needed.

Config items

How should a cluster / ks / session configuration look?

It should allow "don't care" values. Many tests don't really care what about some parameters of the cluster / keyspace (for example, they don't care if tablets are enabled, what is the cluster size, or what is RF of the keyspace). This should be possible to describe in the config, to allow re-using more resources. Apart from "don't care" we could allow ranges (e.g. "cluster with at least 3 nodes").
All the important properties must be present (like amount of nodes/DCs or tablets enablement for clusters, or replication config for keyspace).
For clusters, "exclusive" and "destructible" properties should be present. Former tells if test can use the cluster concurrently with other tests, latter tells if the cluster will be usable after the test finishes.
Using our proxy should also be possible.

Relevant links

https://www.infinyon.com/blog/2021/04/rust-custom-test-harness/
https://crates.io/crates/inventory
https://github.com/infinyon/fluvio/tree/master/crates/fluvio-test/src
https://github.com/infinyon/fluvio/blob/6635b8aae20df796604d60d2706289984bcc8da4/tests/runner/src/fluvio-integration-derive/src/lib.rs

The text was updated successfully, but these errors were encountered:

Lorak-mmk · 2024-12-30T22:56:56Z

@dkropachev I updated the description.

Lorak-mmk self-assigned this Oct 21, 2024

Lorak-mmk mentioned this issue Oct 22, 2024

Move some tests to integration #1102

Merged

8 tasks

Lorak-mmk added the area/testing Related to tests - unit, integration, or even out of repo label Nov 20, 2024

Lorak-mmk mentioned this issue Dec 7, 2024

Multi node tests in CI #307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing topology and custom clusters in integration tests #1097

Changing topology and custom clusters in integration tests #1097

Lorak-mmk commented Oct 21, 2024 •

edited

Loading

Lorak-mmk commented Dec 30, 2024

Changing topology and custom clusters in integration tests #1097

Changing topology and custom clusters in integration tests #1097

Comments

Lorak-mmk commented Oct 21, 2024 • edited Loading

Problem

Naive approach

Better approach

Config items

Relevant links

Lorak-mmk commented Dec 30, 2024

Lorak-mmk commented Oct 21, 2024 •

edited

Loading