Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update K8s helm chart and docs #59

Merged
merged 1 commit into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:
# Login against a Docker registry except on PR
# https://github.com/docker/login-action
- name: Log into registry ${{ env.REGISTRY }}
uses: docker/login-action@343f7c4344506bcbf9b4de18042ae17996df046d # v3.0.0
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
Expand All @@ -54,21 +54,21 @@ jobs:
# https://github.com/docker/metadata-action
- name: Extract Docker metadata
id: meta
uses: docker/metadata-action@96383f45573cb7f253c731d3b3ab81c87ef81934 # v5.0.0
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=semver,pattern={{major}}
type=ref,event=branch
type=sha,priority=100,prefix=,suffix=,format=short


# Build and push Docker image with Buildx (don't push on PR)
# https://github.com/docker/build-push-action
- name: Build and push Docker image
id: build-and-push
uses: docker/build-push-action@0565240e2d4ab88bba5387d719585280857ece09 # v5.0.0
uses: docker/build-push-action@v5
with:
context: .
push: ${{ github.event_name != 'pull_request' }}
Expand Down
2 changes: 1 addition & 1 deletion charts/node-observer/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ image:
repository: ghcr.io/nvidia/topograph
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: "main"
tag: main

imagePullSecrets: []
nameOverride: ""
Expand Down
2 changes: 1 addition & 1 deletion charts/topograph/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ image:
repository: ghcr.io/nvidia/topograph
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: "main"
tag: main

imagePullSecrets: []
nameOverride: ""
Expand Down
6 changes: 3 additions & 3 deletions cmd/topograph/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@ import (
var GitTag string

func main() {
var c string
var cfg string
var version bool
flag.StringVar(&c, "c", "/etc/topograph/topograph-config.yaml", "config file")
flag.StringVar(&cfg, "c", "/etc/topograph/topograph-config.yaml", "config file")
flag.BoolVar(&version, "version", false, "show the version")

klog.InitFlags(nil)
Expand All @@ -47,7 +47,7 @@ func main() {
os.Exit(0)
}

if err := mainInternal(c); err != nil {
if err := mainInternal(cfg); err != nil {
klog.Error(err.Error())
os.Exit(1)
}
Expand Down
27 changes: 17 additions & 10 deletions docs/k8s.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Topograph is a tool designed to enhance scheduling decisions in Kubernetes clust

### Overview

Topograph's primary objective is to assist the Kubernetes scheduler in making intelligent pod placement decisions based on the cluster's network topology. It achieves this by:
Topograph's primary objective is to assist the Kubernetes scheduler in making intelligent pod placement decisions based on the cluster network topology. It achieves this by:

1. Interacting with Cloud Service Providers (CSPs)
2. Extracting cluster topology information
Expand All @@ -16,12 +16,19 @@ Topograph performs the following key actions:

1. **ConfigMap Creation**: Generates a ConfigMap containing topology information. This ConfigMap is not currently utilized but serves as an example for potential future integration with the scheduler or other systems.

2. **Node Labeling**: Applies labels to nodes that define their position within the cloud topology. For example, if a node connects to switch S1, which connects to switch S2, and then to switch S3, Topograph will apply the following labels to the node:
2. **Node Labeling**: Applies labels to nodes that define their position within the cloud network topology:
- `accelerator`: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)
- `block`: Rack-level switches connecting hosts in one or more racks as a block.
- `spine`: Spine-level switches connecting multiple blocks inside a datacenter.
- `datacenter`: Zonal switches connecting multiple datacenters inside an availability zone.

For example, if a node belongs to NVLink domain `nvl1` and connects to switch `s1`, which connects to switch `s2`, and then to switch `s3`, Topograph will apply the following labels to the node:

```
topology.kubernetes.io/network-level-1: S1
topology.kubernetes.io/network-level-2: S2
topology.kubernetes.io/network-level-3: S3
network.topology.kubernetes.io/accelerator: nvl1
network.topology.kubernetes.io/block: s1
network.topology.kubernetes.io/spine: s2
network.topology.kubernetes.io/datacenter: s3
```

### Use of Topograph
Expand All @@ -46,7 +53,7 @@ closer network proximity.
operator: In
values:
- myapp
topologyKey: topology.kubernetes.io/network-level-2
topologyKey: network.topology.kubernetes.io/spine
- weight: 90
podAffinityTerm:
labelSelector:
Expand All @@ -55,15 +62,15 @@ closer network proximity.
operator: In
values:
- myapp
topologyKey: topology.kubernetes.io/network-level-1
topologyKey: network.topology.kubernetes.io/block
```
Pods are prioritized to be placed on nodes sharing the label `topology.kubernetes.io/network-level-1`.
Pods are prioritized to be placed on nodes sharing the label `network.topology.kubernetes.io/block`.
These nodes are connected to the same network switch, ensuring the lowest latency for communication.

Nodes with the label `topology.kubernetes.io/network-level-2` are next in priority.
Nodes with the label `network.topology.kubernetes.io/spine` are next in priority.
Pods on these nodes will still be relatively close, but with slightly higher latency.

In the three-tier network, all nodes will share the same `topology.kubernetes.io/network-level-3` label,
In the three-tier network, all nodes will share the same `network.topology.kubernetes.io/datacenter` label,
so it doesn’t need to be included in pod affinity settings.

Since the default Kubernetes scheduler places one pod at a time, the placement may vary depending on where
Expand Down
Loading