-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(docs): Add docs about architecture and fix structure #1229
Open
kamilprz
wants to merge
17
commits into
microsoft:main
Choose a base branch
from
kamilprz:architecture-docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+5,834
−93
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
062027d
change favicon to use logo with square proportions
kamilprz 8dc5eaf
refactor contributing page
kamilprz d88af74
refactor introduction
kamilprz 08ff11e
restructure captures pages
kamilprz 0d5c14c
change intro diagram
kamilprz f2a7082
what is hubble
kamilprz 83ed6b8
data plane section
kamilprz 4d92cff
default control plane table
kamilprz 9f04004
hubble control plane
kamilprz b2cbc6f
update data plane diagram
kamilprz 18f3fa6
legacy control plane
kamilprz 38444ec
fix file links
kamilprz ac5b229
fix typo
kamilprz 4f47006
fix bad file links
kamilprz 0cfea08
pr feedback v1
kamilprz cb835a6
pr feedback v2
kamilprz bfa83cc
rename legacy, add updating docs section
kamilprz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# What is Retina? | ||
|
||
## Introduction | ||
|
||
Retina is a cloud-agnostic, open-source **Kubernetes Network Observability platform** which enables the use of Hubble as a control plane regardless of the underlying OS or CNI. | ||
|
||
Retina can help with DevOps, SecOps and compliance use cases. | ||
|
||
It provides a **centralized hub for monitoring application and network health and security** (do we provide security?), catering to Cluster Network/Security Administrators and DevOps Engineers. | ||
|
||
Retina **collects customizable telemetry**, which can be exported to **multiple storage options** (such as Prometheus, Azure Monitor, etc.) and **visualized in a variety of ways** (like Grafana, Azure Log Analytics, etc.). | ||
|
||
![High Level Architecture](./img/Retina%20Arch.png "High Level Architecture") | ||
|
||
## Features | ||
|
||
- **[eBPF](https://ebpf.io/what-is-ebpf#what-is-ebpf) based** - Leverages eBPF technologies to collect and provide insights into your Kubernetes cluster with minimal overhead. | ||
- **Platform Agnostic** - Works with any Cloud or On-Prem Kubernetes distribution and supports multiple OS such as Linux, Windows, Azure Linux, etc. | ||
- **CNI Agnostic** - Works with any Container Networking Interfaces (CNIs) like Azure CNI, AWS VPC, etc. | ||
- **Actionable Metrics** - Provides industry-standard Prometheus metrics. | ||
- **Hubble Integration** - Integrates with Cilium's Hubble for additional network insights such as flows logs, DNS, etc | ||
- **Packet Capture** - Distributed packet captures for deep dive troubleshooting | ||
|
||
## Why Retina? | ||
|
||
Retina lets you **investigate network issues on-demand** and **continuously monitor your clusters**. Here are a couple scenarios where Retina shines, minimizing pain points and investigation time. | ||
|
||
### Use Case - Debugging Network Connectivity | ||
|
||
*Why can't my Pods connect to each other any more?* | ||
|
||
**Typical investigation is time-intensive** and involves manually performing packet captures, where one must first identify the Nodes involved, gain access to each Node, run `tcpdump` commands, and export the results off of each Node. | ||
|
||
With Retina, you can **automate this process** with a **single CLI command** or CRD/YAML that can: | ||
|
||
- Run captures on all Nodes hosting the Pods of interest. | ||
- Upload each Node's results to a storage blob. | ||
|
||
To begin using the CLI, see [Quick Start Installation](../02-Installation/02-CLI.md). | ||
|
||
### Use Case - Monitoring Network Health | ||
|
||
Retina supports actionable insights through **Prometheus** alerting, **Grafana** dashboards, and more. For instance, you can: | ||
|
||
- Monitor dropped traffic in a namespace. | ||
- Alert on a spike in production DNS errors. | ||
- Watch changes in API Server latency while testing your application's scale. | ||
- Notify your Security team if a Pod starts sending too much traffic. | ||
|
||
## Telemetry | ||
|
||
Retina uses two types of telemetry: metrics and captures. | ||
|
||
### Metrics | ||
|
||
Retina metrics provide **continuous observability** into: | ||
|
||
- Incoming/outcoming traffic | ||
- Dropped packets | ||
- TCP/UDP | ||
- DNS | ||
- API Server latency | ||
- Node/interface statistics | ||
|
||
Retina provides both: | ||
|
||
- **Basic metrics** - Node-Level (default) | ||
- **Advanced metrics** - Pod-Level (if enabled) | ||
|
||
For more info and a list of metrics, see [Metrics](../03-Metrics/modes/modes.md). | ||
|
||
The same set of metrics are generated regardless of the underlying OS or CNI. | ||
|
||
### Captures | ||
|
||
A Retina capture **logs network traffic** and metadata **for the specified Nodes/Pods**. | ||
|
||
Captures are **on-demand** and can be output to multiple destinations. For more info, see [Captures](../04-Captures/01-overview.md). | ||
|
||
## What is Hubble? | ||
|
||
Hubble is a fully distributed networking and security observability platform designed for cloud-native workloads. It’s built on top of [Cilium](https://cilium.io/get-started/) and [eBPF](https://ebpf.io/what-is-ebpf/), which allows it to provide deep visibility into the communication and behavior of services and the networking infrastructure. | ||
|
||
You can read the official documentation here - [What is Hubble?](https://docs.cilium.io/en/stable/overview/intro/#what-is-hubble) | ||
|
||
Both Hubble and Retina, are listed as emerging [eBPF Applications](https://ebpf.io/applications/)! | ||
|
||
Hubble has historically been quite tightly coupled with Cilium. This led to challenges if you wanted to use another CNI, or perhaps go beyond Linux. Retina bridges this gap, and enables the use of a Hubble control plane on any CNI and across both Linux and Windows. | ||
|
||
Check out our talk from KubeCon 2024 which goes into this topic even further - [Hubble Beyond Cilium - Anubhab Majumdar & Mathew Merrick, Microsoft](https://www.youtube.com/watch?v=cnNUfQKhYiM) | ||
|
||
## Minimum System Requirements | ||
|
||
The following are known system requirements for installing Retina: | ||
|
||
- Minimum Linux Kernel Version: v5.4.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Architecture | ||
|
||
## Overview | ||
|
||
In very simple terms, Retina collects metrics from the machine it's running on and hands them over to be processed and visualized elsewhere (in tools such as Prometheus, Hubble UI or Grafana). | ||
|
||
To collect this data, Retina observes and hooks on to system events within the kernel through the use of custom eBPF plugins. The data gathered by the plugins is then transformed into `flow` objects ([defined by Cilium](https://github.com/cilium/cilium/tree/main/api/v1/flow)) and enriched with Kubernetes context, before being converted to metrics and exported. | ||
|
||
## Data Plane | ||
|
||
This section discusses how Retina collects its raw data. More specifically, it discusses how the eBPF programs and plugins are used. | ||
|
||
The plugins have a very specific scope by design, and Retina is designed to be extendable, meaning it is easy to add in additional plugins if necessary. If there is a plugin missing for your use case, you can create your own! See our [Development page](../07-Contributing/02-development.md) for details on how to get started. | ||
|
||
The plugins are responsible for installing the eBPF programs into the host kernel during startup. These eBPF programs collect metrics from events in the kernel level, which are then passed to the user space where they are parsed and converted into a `flow` data structure. Depending on the Control Plane being used, the data will either be sent to a Retina Enricher, or written to an external channel which is consumed by a Hubble observer - more on this in the [Control Plane](#control-plane) section below. It is not required for a plugin to use eBPF, it can also use syscalls or other API calls. In either case, the plugins will implement the same [interface](https://github.com/microsoft/retina/blob/main/pkg/plugin/registry/registry.go). | ||
|
||
Some examlpes of existing Retina plugins: | ||
|
||
- Drop Reason - measures the number of packets/bytes dropped and the reason and the direction of the drop. | ||
- DNS - counts DNS requests/responses by query, including error codes, response IPs, and other metadata. | ||
- Packet Forward - measures packets and bytes passing through the eth0 interface of each node, along with the direction of the packets. | ||
|
||
You can check out the rest on the [Plugins](../03-Metrics/plugins/readme.md) page. | ||
|
||
!["Retina Data Plane"](./img/data-plane.png "Retina Data Plane") | ||
|
||
### Plugin Lifecycle | ||
|
||
The [Plugin Manager](https://github.com/microsoft/retina/tree/main/pkg/managers/pluginmanager) is in charge of starting up all of the plugins. It can also reconcile plugins, which will regenerate the eBPF code and the BPF object. | ||
|
||
The lifecycle of a plugins themselves can be summarized as follows: | ||
|
||
- Initialize - Initialize eBPF maps. Create sockets / qdiscs / filters etc. Load eBPF programs. | ||
- Start - Read data from eBPF maps and arrays. Send it to the appropriate location depending on the Control Plane. | ||
- Stop - Clean up any resources created and stop any threads. | ||
|
||
The Plugin Manager also starts up the [Watcher Manager](https://github.com/microsoft/retina/tree/main/pkg/managers/watchermanager) - which in turn starts the watchers. | ||
|
||
The Endpoint Watcher periodically dumps out a list of veth interfaces corresponding to the pods, and then publishes an `EndpointCreated` or `EndpointDeleted` event depending on the lists current state compared to the last recorded state. These events are consumed by the Packet Parser and converted into flows. | ||
|
||
The API Server Watcher resolves the hostname of the API server it is monitoring to a list of IP addresses. It then compares these addresses against a cache of IP addresses which it maintains and publishes a `NewAPIServerObject` event containing the new IPs if necessary. This information is added to the IP cache and used to enrich the flows. | ||
|
||
## Control Plane | ||
|
||
This section describes how the collected data from the Data Plane is processed, transformed and used. | ||
|
||
Retina currently has two options for the Control Plane: | ||
|
||
- [Hubble Control Plane](#hubble-control-plane) | ||
- [Standard Control Plane](#standard-control-plane) | ||
|
||
| Platform | Supported Control Plane | | ||
|----------|----------------------------| | ||
| Windows | Standard | | ||
| Linux | Standard, Hubble | | ||
|
||
Both Control Planes integrate with the same Data Plane, and have the same contract which is the `flow` data structure. Both Control Planes also generate metrics and traces, albeit different metrics are supported by each. See our [Metrics page](../03-Metrics/01-metrics-intro.md) for more information. | ||
|
||
Please refer to the [Installation](../02-Installation/01-Setup.md) page for further setup instructions. | ||
|
||
### Hubble Control Plane | ||
|
||
When the Hubble Control Plane is being used, the data from the plugins is written to an `external channel`. A component called the [Monitor Agent](https://github.com/microsoft/retina/tree/main/pkg/monitoragent) monitors this channel, and keeps track of a list of listeners and consumers. One of such consumers is the [Hubble Observer](https://github.com/microsoft/retina/blob/main/pkg/hubble/hubble_linux.go). This means that when the Monitor Agent detects an update in the channel it will forward the data to the Hubble Observer. | ||
|
||
The Hubble Observer is configured with a list of parsers capable of interpreting different types of `flow` objects(L4, DNS, Drop). These are then enriched with Kubernetes specific context through the use of Cilium libraries (red blocks in the diagram). This includes mapping IP addressses to Kubernetes objects such as Nodes, Pods, Namespaces or Labels. This data comes from a cache that Retina maintains of Kubernetes metadata keyed to IPs. | ||
|
||
Hubble uses the enriched flows to generate `hubble_*` metrics and flow logs, which are then served as follows: | ||
|
||
- Server 9965 - Hubble metrics (Prometheus) | ||
- Remote Server 4244 - Hubble Relay connects to this address to gleam flow logs for that node. | ||
- Local Unix Socket `unix:///var/run/cilium/hubble.sock` - serves node specific data | ||
|
||
!["Hubble Control Plane"](./img/hubble-control-plane.png "Hubble Control Plane") | ||
|
||
### Standard Control Plane | ||
|
||
When the Standard Control Plane is being used, the data from the plugins is written to a custom [Enricher](https://github.com/microsoft/retina/tree/main/pkg/enricher) component. This component is not initialized when using the Hubble Control Plane, and so the plugins know where to write the data to. | ||
|
||
Retina maintains a cache of Kubernetes objects. The Enricher makes use of this cache to enrich the `flow` objects with this information. After enrichment, the `flow` objects are exported to an output ring. | ||
|
||
The [Metrics Module](https://github.com/microsoft/retina/blob/main/pkg/module/metrics/metrics_module.go) reads the data exported by the Enricher and constructs metrics out of it, which it then exports itself. | ||
|
||
!["Standard Control Plane"](./img/control-plane.png "Standard Control Plane") |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this image doesn't load correctly, might be because of the space in the image's name might messing up the path or it's a Github issue