-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement CI for RHOBS #418
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ❤️ for doing this long-standing work! I really appreciated all the effort put into this. I have a couple of improvement comments to make it sustainable and easy to extend for our other teams (e.g. my folks and myself from the log storage team and the tracing folks).
- Let's try to squeeze as much as possible out of existing GitHub actions for setting up microshift, docker, oc binary. Using actions of the shelf improves caching and in turn build times.
- Much of the ci-test.sh can be simplified with existing commands as
oc rollout
or the olderoc wait
. Consider to re-use as much as possible from: observatorium e2e.sh - Since we start a huge system into a cluster, it would super helpful to have a must-gather in case of erroneous build runs, e.g. have a look here observatorium must-gather func
- name: Print CWD | ||
run: echo "$PWD" | ||
- name: Configure env | ||
run: | | ||
sudo apt-get update | ||
sudo apt-get install \ | ||
ca-certificates \ | ||
curl \ | ||
wget \ | ||
gnupg \ | ||
lsb-release | ||
sudo mkdir -m 0755 -p /etc/apt/keyrings | ||
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg | ||
echo \ | ||
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ | ||
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null | ||
sudo chmod a+r /etc/apt/keyrings/docker.gpg | ||
sudo apt-get update | ||
- name: Install Docker | ||
run: | | ||
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin | ||
- name: Check docker is running | ||
run: | | ||
sudo systemctl status docker | ||
- name: Get Docker info | ||
run: | | ||
docker info | ||
- name: Get OC CLI Binary | ||
run: | | ||
wget https://github.com/okd-project/okd/releases/download/4.12.0-0.okd-2023-02-18-033438/openshift-client-linux-4.12.0-0.okd-2023-02-18-033438.tar.gz | ||
tar xzvf openshift-client-linux-4.12.0-0.okd-2023-02-18-033438.tar.gz | ||
sudo mv oc kubectl /usr/local/bin | ||
- name: Run OC Binary | ||
run: oc --help | ||
- name: Spin up microshift container | ||
run: | | ||
docker run -d --name microshift --privileged -v microshift-data:/var/lib -p 6443:6443 -p 80:80 -p 443:443 quay.io/microshift/microshift-aio:latest | ||
sleep 60s | ||
- name: Export the kubeconfig | ||
run: docker exec -i microshift cat /var/lib/microshift/resources/kubeadmin/kubeconfig > tests/kubeconfig | ||
- name: List the kubeconfig file | ||
run: ls tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate if these steps can/cannot simply replaced with using microshifts github action? https://github.com/marketplace/actions/microshift-openshift-cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using github actions to deploy Microshift but seems like there are some issue with Ubuntu and as per this open issue I thought of going with docker deployment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it mandatory to use Ubuntu then and not simply the default OS assumed by the microshift action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently Microshift only works on RHEL family of OS and they don't have multi-os support. Also github actions supports ubuntu as only linux distribution as of now😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a short read-up on microshift and I believe it is not worth the effort to use it as the base platform for our RHOBS CI. It is far away from having support of non RHEL-OS's as you say but it will stay too limited from an resource perspective (by design AFAIU).
For our CI I suggest to use kind instead, it is robust and its main purpose is supporting the development workflow. It is super easy to spin-up with a github action (See https://github.com/marketplace?type=actions&query=kind+)
For example we use is building our upstream operators (see Loki Operator quickstart.sh or Loki Operator GH Action).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some exploration around using kind as well. The manifests which are generated currently are OpenShift based templates and some relies on OpenShift-specific resources/features which are not available on standard Kubernetes distribution. So deploying those templates straight forward would not be easy. Some modifications will be there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you list the manifests please? The only show-stopper using kind is oc process
but even this is can used as oc process --local -f
without a need of the OpenShift-APIServer. If I assume right you mean manifests like ServiceMonitor
right? If yes we can still use kind by simply applying the CRDs from the prometheus-operator (See observatorium e2e.sh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRD's are no problem. The manifests after applying required CRD's deploy fine. I am not able to recall exact issue here but there was one instance in case of deploying compact-tls
secret which is a service serving certificate secret that we use when deploying thanos-compact. AFAIK for deploying on Kubernetes there is no straight forward way to achieve this.
My only Intention here to use Microshift was to play less around cluster deployment and deploying the generated manifests in a OCP kind of environment as ultimately these manifests will be used to deploy RHOBS on stag/prod env and it can give us more or less actual env like confidence.
run: docker exec -i microshift cat /var/lib/microshift/resources/kubeadmin/kubeconfig > tests/kubeconfig | ||
- name: List the kubeconfig file | ||
run: ls tests | ||
- name: Deploy pre-requisites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way to offload these inline YAML into static files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely will move them into static files
check_pod_status(){ | ||
podname=$1 | ||
namespace=$2 | ||
retry=3 | ||
while [ $retry -ne 0 ] | ||
do | ||
containerStatuses=$(oc get $podname -n $namespace -o jsonpath='{.status.containerStatuses[*].state}') | ||
if [[ $containerStatuses == *'waiting'* || -z $containerStatuses ]]; | ||
then | ||
|
||
log_error "Output: $containerStatuses" | ||
log_error "Retrying again..." | ||
else | ||
log_info "Status of $podname is healthy inside $namespace namespace" | ||
return | ||
fi | ||
sleep 30 | ||
((retry--)) | ||
done | ||
log_error "Retry exhausted!!!" | ||
teardown | ||
exit 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking rollouts is better served using the oc rollout status deployment...
and oc rollout status statefulset
commands. This will eliminate maintaining many bash script lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats the excellent point. I never tried checking status using oc rollout status deployment
or oc rollout status statefulset
. Will definitely try these and make changes accordingly in script
destroy(){ | ||
depname=$1 | ||
namespace=$2 | ||
log_info "Destroying $depname resources inside $namespace namespace" | ||
if [ $depname == 'memcached' ]; | ||
then | ||
return | ||
fi | ||
oc delete statefulsets -n $namespace -l app.kubernetes.io/name=$depname 1> /dev/null | ||
oc delete deployment -n $namespace -l app.kubernetes.io/name=$depname 1> /dev/null | ||
oc delete pvc -n $namespace --all=true 1> /dev/null | ||
sleep 30 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to destroy each statefulset/deployment individually, you can reuse the tempalte for this:
oc process -f observatorium-metrics.template.yam | oc delete -f -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my thought behind destroying statefulset/deployment is because of resource limit on GitHub hosted runners i.e 2-core CPU 7 GB of RAM and 14 GB of SSD. Due to this we cannot deploy all components of RHOBS at once. That's why I had to deploy each component one by one verify it and then destroy it so that sufficient room is available for next component to deploy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree insufficient resources are always PITA. However spinning up partially the components does not account for the fact that the whole system can be used for a test. The current approach only tests that they start which is valuable as such but very limited. IMHO we should aim using a CI environment that allows us to spin up the whole system (with super limited CPU/Mem maybe as we do in observatorium e2e on CirceCI) and use obervatorium/up to do at minimum blackbox testing (send metrics/logs in, read metrics/logs out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition if we cannot get more power from GitHub Actions, then maybe piggy-packing on observatorium's e2e test setup on CircleCI is maybe a good pivot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely the current approach only tests that pods spin up in healthy state on the manifests changes and provide a quick validation against those changes so that we don't get any surprises once it is deployed in stage env.
Initially I also had the same thought that there should be some kind of e2e test which touch every component but due to GitHub actions limitations I scrapped that.
I think the runtest inside post deploy job does that what you are suggesting of using obervatorium/up to send and read metrics/logs.
If we wish to integrate runtest like checks in CI then it can also be done but for that as you mentioned correctly we may need to move away from GitHub Actions due to its limitations and explore around CircleCI.
Let's see what are team's thoughts on moving away from GitHub Actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we should not limit our CI to what the post-deploy jobs do and don't. The post-deploy jobs currently do tests that should have been part of the CI first. The actual purpose of the post-deploy jobs on app-interface is to guard auto-promotion of changes from stage to production and a simply blackbox test (as currently implemented) is not sufficient. Thus it is not a limit to have this blackbox test in our CI here in Github or Circle CI.
To pave the road here, I suggest:
- Pull simple blackbox tests (like the one with up) here into our repo's CI. This ensures that we have some confidence in automated testing our Pull Requests before they land. Of course this means under the limits of our CI (e.g. microshift is not a fully-capable AppSRE-managed cluster). However it will give some notion of it won't break in essential steps (e.g.
oc process
missing params,oc apply
missing resources, wrong configuration breaking up's in/out testing). - In a follow-up step we need to define what type of tests really guard the promotion from stage into production and implement them in the post-deploy jobs (i would keep the up -test there too to ensure that in/out testing works on the target cluster platform too), e.g.
a. Canary-testing that requests do not break our SLO's (e.g. availability, latency) using avalance or loki-canary.
b. Canary-testing all tenants have still access to their metrics/logs
Thanks @periklis 😊 for taking time to go through this. Your suggestions are insightful and top notch 🙇🏼♂️. I didn't realised the thought of extending it for other teams. I agree with you it will make CI super easy to use. |
@vprashar2929 is this still valid now that #421 is merged? If not, I think we should close it. |
Yes we can close this PR |
This PR contains some changes which are related to implementing CI for RHOBS
The workflow
test.yaml
performs the following tasks:tests/ci_test.sh
The
ci_test.sh
primarily deploys each resources of RHOBS one at a time on microshift cluster and then do a pod health check just to verify that pod is in running state after applying the required manifests. This is related to RHOBS-552. Currently it covers only basic validation. The in-depth validation can be covered under post-deploy jobThis is still in progress so any comments/suggestions or contributions are welcome😄
If you want to test with any breakable commit then just change the following syntax inside
.github/workflows/test.yaml
and push with your breakable commit/changes to verifyNote: Currently the test pipeline takes approx 30 min to complete. Will try to reduce this number if possible