Skip to content

Latest commit

 

History

History
351 lines (261 loc) · 26.8 KB

reconfigurator.adoc

File metadata and controls

351 lines (261 loc) · 26.8 KB

Reconfigurator

1. Introduction

Reconfigurator is a system within Nexus for carrying out all kinds of system changes in a controlled way. Examples of what Reconfigurator can do today or that we plan to extend it to do in the future:

  • add a new sled to a system (deploying any components it needs to be able to run customer workloads or participate in the control plane)

  • expunge a sled from a system (i.e., respond to the permanent physical removal of a sled)

  • deploy a new instance of a component (e.g., Nexus) to scale it out or replace one that’s failed

  • remove an instance of a component (e.g., Nexus) to scale it down or as part of moving it to another sled

  • recover a sled whose host software is arbitrarily broken (similar to mupdate)

  • update the host OS, service processor, control plane components, and other software on the sled

Let’s walk through how someone might add a new sled to an existing system using Reconfigurator.

  1. The user makes an API request to add the sled.[1] This request succeeds immediately but has not done most of the work yet.

  2. The user makes an API request to invoke the planner (one of two main parts of Reconfigurator). The planner generates a new blueprint.

  3. Optional: the user has a chance to compare the new blueprint against the existing one. This lets them see what the system is going to do before committing to it.

  4. The user makes an API request to make the new blueprint the current system target blueprint.

  5. Reconfigurator does the rest: it starts executing the blueprint. This means trying to make reality match the blueprint. This process continues in a loop basically forever.[2]

Note
Currently, these "user"-driven steps are carried out using omdb, which means that in production systems they can really only be done by an Oxide employee with access via the technician port. The long-term goal is to automate all of this so that planning and execution both happen continuously and automatically. In that world, the user only has to do step 1.

You might notice that the only step that has anything to do with adding a sled is the first one. This is where a user communicated a change to the intended state of the world. Everything Reconfigurator does follows a pattern just like this:

  1. In the background, Reconfigurator is constantly updating its view of the actual state of the world.

  2. A user changes the intended state of the world.

  3. The user invokes the planner to generate a new blueprint.

  4. The user makes the new blueprint the current target.

  5. Reconfigurator executes the target blueprint.

Reconfigurator makes heavy use of well-established patterns, most notably:

  • The "Reconciler" pattern (which we’ve elsewhere called the "Reliable Persistent Workflow" or "RPW" pattern — see [RFD 373]). In this pattern, a system is constantly comparing an intended state of the world against the actual state, determining what to do to bring them in line, and then doing it. This is very different from distributed sagas, where the set of steps to be taken is fixed up front and cannot change if either the intended state or the actual state changes. Reconcilers are not necessarily better than distributed sagas. Each is suited to different use cases.

  • The [Plan-Execute Pattern], in which complex operations are separated into a planning phase and an execution phase. This separation is fundamental to Reconfigurator.

2. Reconfigurator overview

Important data:

  • Policy describes intended system state in terms that a user or Oxide technician might want to control: what release the system is running, what sleds are part of the system, how many CockroachDB nodes there should be, etc.

  • Inventory collections seek to describe the actual state of the system, particularly what hardware exists, what versions of what software components are running everywhere, the software’s configuration, etc. Note that collecting this information is necessarily non-atomic (since it’s distributed) and individual steps can fail. So when using inventory collections, Reconfigurator is always looking at a potentially incomplete and outdated view of reality.

  • Blueprints are very detailed descriptions of the intended state of the system. Blueprints are stored in CockroachDB. They are described in detail below.

  • At any given time, there is one target blueprint that the system is continuously trying to make real. This is stored in CockroachDB.

Reconfigurator consists mainly of:

  • a planner that takes as input the latest policy and the latest state of the system (including inventory collections and other database state) and produces a blueprint. Generating a blueprint does not make it the new target! The new blueprint is a possible next state of the world. Only when it’s made the current target does the system attempt to make that blueprint real. (It’s also possible that a blueprint never becomes the target. So the planning process must not actually change any state, only propose changes.)

  • an executor that attempts to execute a blueprint — that is, attempts to make reality match that blueprint.

It’s important that users be able to change the intended state (change policy, generate a new blueprint, and make that blueprint the target) even when the previous target has not been fully executed yet.[3]

The process of executing a blueprint involves a bunch of database queries and API calls to Sled Agent, Service Processors (via MGS), Dendrite, etc. As a result:

  • Executing a blueprint is not atomic.

  • Executing a blueprint can take a while.

  • Any of the individual actions required to execute a blueprint can fail.

  • The component executing the blueprint (Nexus) can itself fail, either transiently or permanently.

  • To deal with that last problem, we have multiple Nexus instances. But that means it’s conceivable that multiple Nexus instances attempt to execute different blueprints concurrently or attempt to take conflicting actions to execute the same blueprint.

2.1. The primacy of planning

To make all this work:

  • Blueprints are immutable. (If Nexus wants to change one, it creates a new one based on that one instead.)

  • Blueprints must be specific enough that there are no meaningful choices for Nexus to make at execution time. For example, when provisioning a new zone, the blueprint specifies the new zone’s id, what sled it’s on, its IP address, and every other detail needed to provision it.

  • CockroachDB is the source of truth for which blueprint is the current target.

  • Every blueprint has a parent blueprint. A blueprint can only be made the target if its parent is the current target.

The planner is synchronous, mostly deterministic, relatively simple, and highly testable. This approach essentially moves all coordination among Nexus instances into the planning step. Put differently: Nexus instances can generate blueprints independently, but only one will become the target, and that one is always an incremental step from the previous target. Many tasks that would be hard to do in a distributed way (like allocating IPs or enforcing constraints like "no more than one CockroachDB node may be down for update at once") can be reduced to pretty straightforward, highly testable planner logic.

As a consequence of all this:

  • At any given time, any Nexus instance may generate a new blueprint and make it the new system target (subject to the constraints above). Multiple Nexus instances can generate blueprints concurrently. They can also attempt to set the target concurrently. CockroachDB’s strong consistency ensures that only one blueprint can be the target at any time.

  • At any given time, any Nexus instance may be attempting to execute (realize) a blueprint that it believes is the latest target. It may no longer be the current target, though. Details are discussed below.

  • Nexus instances do not directly coordinate with each other at all.

2.2. Execution

While our approach moves a lot of tricky allocation / assignment problems to the planning step, execution brings its own complexity for two main reasons: Nexus instances can execute blueprints concurrently and any Nexus instance may be executing an old blueprint (i.e., one that was the target, but is not any more).[4]

Even when these things happen, we want that the system:

  • never moves backwards (i.e., towards a previous target)

  • converges towards the current target

This is easier than it sounds. Take the example of managing Omicron zones.

Example: managing Omicron zones

Reconfigurator manages the set of Omicron zones running on each sled. How can we ensure that when changes are made, the system only moves forward even when there are multiple Nexus instances executing blueprints concurrently and some might be executing older versions?

First, we apply a generation number to "the set of Omicron zones" on each sled. Blueprints store for each sled (1) the set of zones on that sled (and their configuration) and (2) the generation number. Any time we want to change the set of zones on a sled, we make a new blueprint with the updated set of zones and the next generation number. Execution is really simple: we make an API call to sled agent specifying the new set of zones and the new generation number. Sled Agent keeps track of the last generation number that it saw and rejects requests with an older one. Now, if multiple Nexus instances execute the latest target, all will succeed and the first one that reaches each Sled Agent will actually update the zones on that sled. If there’s also a Nexus executing an older blueprint, it will be rejected.

This approach can be applied to many other areas like DNS configuration, too. Other areas (e.g., the process of updating database state to reflect internal IP allocations) sometimes require different, ad hoc mechanisms. In all cases, though, the goals are what we said above: attempting to execute a stale blueprint must never move the system backwards and as long as something is executing the newer blueprint, the system should eventually get to the new target.

3. Reconfigurator control: autonomous vs. manual

3.1. Long-term goal: autonomous operation

The long-term goal is to enable autonomous operation of both the planner and executor:

The Planner

    fleet policy  (database state, inventory)   (latest blueprint)
             \               |               /
              \              |              /
               +----------+  |  +----------/
                          |  |  |
                          v  v  v

                         "planner"
              (eventually a background task)
                             |
                             v                      no
                    is a new blueprint necessary? ------> done
                             |
                             | yes
                             v
                    generate a new blueprint
                             |
                             |
                             v
                    commit blueprint to database
                             |
                             |
                             v
                    make blueprint the target
                             |
                             |
                             v
                           done
The Executor

           target blueprint  latest inventory
                     |             |
                     |             |
                     +----+   +----+
                          |   |
                          v   v

                        "executor"
                     (background task)
                            |
                            v
                    determine actions needed
                    take actions

This planner will evaluate whether the current (target) blueprint is consistent with the current policy. If not, the task generates a new blueprint that is consistent with the current policy and attempts to make that the new target. (Multiple Nexus instances could try to do this concurrently. CockroachDB’s strong consistency ensures that only one can win. The other Nexus instances must go back to evaluating the winning blueprint before trying to change it again — otherwise two Nexus instances might fight over two equivalent blueprints.)

The execution task will evaluate whether the state reflected in the latest inventory collection is consistent with the current target blueprint. If not, it executes operations to bring reality into line with the blueprint. This means provisioning new zones, removing old zones, adding instances to DNS, removing instances from DNS, carrying out firmware updates, etc.

3.2. Currently: plan on-demand, execute continuously

We’re being cautious about rolling out that kind of automation. Instead, today, omdb can be used to:

  • invoke the planner explicitly to generate a new blueprint

  • set a blueprint to be the current target

  • enable or disable execution of the current target blueprint. If execution is enabled, all Nexus instances will concurrently attempt to execute the blueprint.

omdb uses the Nexus internal API to do these things. Since this can only be done using omdb, Reconfigurator can really only be used by Oxide engineering and support, not customers.

To get to the long term vision where the system is doing all this on its own in response to operator input, we’ll need to get confidence that continually executing the planner will have no ill effects on working systems. This might involve more operational experience with it, more safeties, and tools for pausing execution, previewing what it would do, etc.

4. Design patterns

4.1. Incorporating existing configuration into Reconfigurator

Something we’ve done several times now is taking some existing piece of configuration that was managed outside the control plane (i.e., not known to Nexus or CockroachDB) and brought it under the ownership of the control plane. Examples:

  • Control plane zones: when we initially built the system, RSS deployed control plane zones to sleds and Nexus/CockroachDB was largely unaware of them. Nexus/CockroachDB did know about them, but did not have enough information to reconstruct the configuration on each sled. But of course the control plane needs to be able to manage these components for upgrade, fault management, scale-out, etc. Migrating to a system that can do these things required that the control plane learn what zones were deployed already, where, and with what configuration.

  • ZFS datasets: in releases prior to R12, Sled Agent automatically created ZFS datasets for zones that were requested by the control plane. Concretely, PUT /omicron-zones would create ZFS datasets for zones that needed a persistent dataset. Work is ongoing to make this more explicit so that the control plane manages datasets separately and then specifies with each zone which dataset it should use. Migrating to this approach on deployed systems requires that the control plane learn what datasets exist in the first place and what zones they’re associated with.

  • In the medium term, for online upgrade, we will be incorporating an image id or artifact id into the Omicron zone configuration. Currently, the artifact id is implied: sled agent uses whatever artifact was delivered by the last MUPdate. For online upgrade, the control plane will need to be able to specify a particular artifact.

In all of these cases:

  • There’s a piece of configuration managed outside of CockroachDB/Nexus (e.g., by Sled Agent, RSS, and/or MUPdate).

  • We want to transition to a world where the configuration is owned by CockroachDB/Nexus.

  • We need to bootstrap the initial control-plane-managed copy based on what’s currently deployed.

The general pattern here is:

  • Use the inventory system to collect the current configuration.

  • Incorporate that configuration into the next blueprint generated by Reconfigurator.

  • Incorporate the configuration from the blueprint into the next request to Sled Agent.

sequenceDiagram
    participant SledAgent as Sled Agent
    participant Nexus

    Note over SledAgent: owns some piece<br/> of configuration
    SledAgent ->> Nexus: reports current configuration<br/>via inventory
    Note over Nexus: incorporates latest inventory<br/>into next blueprint
    Note over Nexus: now owns the configuration

    loop
	Note over Nexus: changes configuration as needed
        Nexus ->> SledAgent: sends new configuration<br/>
    end
Loading

Below is a proposed pattern for doing this over two releases. We’ll call the new piece of config my_config, represented with type MyConfig. This could be arbitrarily complicated. In our examples above, this could be a list of zones and their detailed configurations (IP addresses, ports, and all the properties they need to start), a list of ZFS dataset structs, a dataset_id property on an existing struct, an artifact_id property on an existing struct, etc. It may hang directly off the Sled Agent Inventory or it might be embedded in some existing struct.

Note
This is a work in progress. We hope this closely enough matches what we’ve done in the past that it should work. We should update this documentation as we discover better ways to do this.
Caution
This isn’t the only way to do it. However, many other ways to do it come with non-obvious problems. When we diverge from this, we should first try to understand why this procedure looks the way it does.

In the first release (we’ll call it "release 1"): the configuration is totally managed in Sled Agent and unknown to Nexus and CockroachDB.

In the next release (we’ll call it "release 2"):

  1. Add my_config: MyConfig to appropriate spot in Sled Agent inventory.

    • In the inventory API that Sled Agent exposes, this field can be non-optional. In this use case, it’s assumed that Sled Agent can know what the current value is. That is, the code in this release must be aware that this value, which may previously have been hardcoded or even absent altogether, is now a variable to be reported in inventory (and eventually controlled by Nexus — see below).

    • In the Nexus inventory structures and database inventory structures, the field still needs to be optional (my_config: Option<MyConfig> or equivalent) because Nexus generally needs to be able to read inventory structures written by the previous release.

  2. Add my_config: Option<MyConfig> to the blueprint structures (both in-memory and in the database). This field has to be optional so that when updating to this release, the system can still read the current target blueprint (that was written in the previous release that didn’t have this field).

  3. In the Reconfigurator planner, when generating a blueprint based on a parent blueprint where my_config is None, fill in my_config (using a Some value) based on the contents in inventory.

  4. Add my_config to the Sled Agent request that will be used by Reconfigurator to configure this on each sled.

    • If a request already exists (e.g., if this will be part of OmicronZoneConfig that already gets sent by Reconfigurator to Sled Agent, as in the case of ZFS dataset id or artifact id): the new field should be optional: my_config: Option<MyConfig>. This is required for the system to be able to execute the last target blueprint that was written in the previous release. This is typically also necessary because it’s usually the same struct that Sled Agent records persistently. See the next item.

    • If no request already exists for this purpose, then you’ll be adding a whole new one (e.g., when we added a new PUT /datasets). The body of this request will generally be type MyConfig (not optional). During execution, Reconfigurator can avoid making this request altogether if the blueprint does not specify it.

  5. Add my_config to the Sled Agent ledger that will store this information persistently. This will almost always be the same as the previous step. The structure that Sled Agent stores is generally the same one it accepts from Nexus.

    This explains another reason why my_config should be optional in this structure: Sled Agent must be able to read ledgers written by a previous release and those won’t have this field.

During the upgrade to this release:

  1. Wait for at least one inventory cycle to complete successfully and verify that it contains the expected my_config field.

  2. Generate a new blueprint, make it the current target, and ensure that it executes successfully. It should make no actual changes to the system, but it will propagate the current values for my_config to the blueprint system and to sled ledgers.

  3. Verify that:

    • the new blueprint has my_config filled in

    • all Sled Agent ledgers have my_config filled in (value Some)

In the next release (we’ll call it "release 3"): all the optional fields can be made non-optional:

  • Blueprints' in-memory structure can go from my_config: Option<MyConfig> to my_config: MyConfig.

  • Nexus’s in-memory structure for inventory can go from my_config: Option<MyConfig> to my_config: MyConfig.

  • Blueprints' and inventory collections' database representations can go from NULL-able columns to non-NULL-able ones, though only if we can populate the value or drop old blueprints and collections. More work is needed here (see below).

  • The Sled Agent API input types and ledgers that refer to my_config can go from my_config: Option<MyConfig> to my_config: MyConfig. No on-disk changes are needed for this.

During the upgrade to the next release: Blueprints and inventory collections that do not have my_config set will need to be deleted from the database prior to the upgrade. See omicron#7278 for more on operationalizing this.

Visually:

flowchart TD
    subgraph R1 [Release 1]
        Initial["**Config owned by Sled Agent**"]
    end

    subgraph R2 [Release 2]
        Inventory["Sled Agent: reports current config in inventory"]
        Blueprint["Reconfigurator Planner: incorporates latest inventory into blueprint"]
        SledAgent["Reconfigurator Executor: sends blueprint config (unchanged) as configuration to Sled Agent"]
        Handoff["**Config owned by Nexus**"]
        Change21["Nexus wants to change the config"]
        Change22["Reconfigurator Planner: uses new value in blueprint"]
        Change23["Reconfigurator Executor: sends new value as new configuration to Sled Agent"]

        Inventory --> Blueprint
        Blueprint --> SledAgent
        SledAgent --> Handoff
        Handoff --> Change21
        Change21 --> Change22
        Change22 --> Change23
        Change23 --> Change21
    end

    subgraph R3 [Release 3]
        Owned["**Config owned by Nexus**"]
        Cleanup["**Blueprint field, Sled Agent field are now required**"]
        Change31["Nexus wants to change the config"]
        Change32["Reconfigurator Planner: uses new value in blueprint"]
        Change33["Reconfigurator Executor: sends new value as new configuration to Sled Agent"]
        Owned --> Cleanup
        Cleanup --> Change31
        Change31 --> Change32
        Change32 --> Change33
        Change33 --> Change31
    end

    R1 --> R2
    R2 --> R3
Loading

During release 1 and during release 2 before Sled Agent has reported the configuration in inventory, things look like this:

sequenceDiagram
    box Nexus
        participant Planner as Reconfigurator Planner
        participant Executor as Reconfigurator Executor
    end
    participant SledAgent as Sled Agent
    participant Database


    loop while config is not part of inventory
        Database ->> Planner: load latest inventory: config NOT present
        Planner ->> Executor: generate blueprint:<br />config NOT present
        Executor ->> SledAgent: write config:<br />config NOT present
        Note over SledAgent: missing config<br/>treated as<br />"no change"
    end
Loading

Shortly after the system comes up in release 2, Sled Agent starts reporting the config in inventory. After that point, things look like this:

sequenceDiagram
    box Nexus
        participant Planner as Reconfigurator Planner
        participant Executor as Reconfigurator Executor
    end
    participant SledAgent as Sled Agent
    participant Database

    loop
        SledAgent ->> Database: report config<br />in inventory
    end

    loop
        Database ->> Planner: load latest inventory: config IS present
        Planner ->> Executor: generate blueprint:<br />config IS present
        Executor ->> SledAgent: write config:<br />config IS present
        Note over SledAgent: config is present<br/>and honored
    end
Loading

1. There is a bit more to this flow. There’s an API request to list sleds that are physically present but not part of the system. The user is expected to compare that list against what they expect and then make an API request to add the specific sled they expect to be there (by serial number).
2. The process does not stop once reality matches the blueprint because reality can change after that point and the system may need to take action again.
3. It’s tempting to try to simplify things by disallowing users from changing the intended state while the previous blueprint is being executed. But there are many cases where this behavior is necessary. Imagine the operator requests that the system gracefully remove one sled. So the system starts live-migrating customer instances on that sled to other sleds. Then the sled suddenly fails permanently (i.e., catches fire). If we couldn’t change the intended next state to say "this sled is gone", then the system would be stuck forever waiting for those instances to successfully live-migrate, which they never will. This is just one example. Besides that, if there were any kind of bug causes Reconfigurator to get stuck, fixing it or working around it requires that the operator or Oxide support be able to change the intended state even though the system hasn’t reached the current intended state (which it never will because it’s stuck).
4. These are unavoidable consequences of not doing leader election to choose one Nexus to carry out execution. Why not do that? Because that creates harder problems like monitoring that Nexus, determining when it seems to have failed or become stuck, failover — including in cases where that Nexus is not stuck or failed, but merely partitioned — etc. This is all possible, but hard, and these code paths are not often tested in production systems. With our approach, there is one main code path and it’s frequently tested. (Admittedly, it can still do different things depending on what’s executing concurrently.)