Skip to content

dunedaq and pocket in the post v2.10.0 era

ron003 edited this page Apr 28, 2022 · 15 revisions

A prototype of dunedaq running in a kubernetes (kind) cluster

BEWARE:

  1. At least one step of these instructions may require sudo rights, specifically the firewall configuration.
  2. The following instructions rely on docker 20 or greater, which is not available on Centos7 natively but as a third-party package set under the name od docker-ce. Installation instructions are available here.

HOWEVER, it is currently recommended to run these instructions on one of several computers in the np04daq cluster (namely one of np04-srv-001, -010, -019, -021, -022, and -024), and on those computers, the docker user group is being used to enable the communication with the docker daemon. Please avoid using readout hosts: -26, -28, -29, -30; Except if you are testing with readout hardware. (Note that almost all the nodes are CentOS Stream 8 nodes). The use of the docker user group reduces the need for special privileges, as does the fact that a firewall is not typically run on these computers. If/when you are ready to try these instructions on one of the special np04daq computers, please get in touch with system administrators to get added to the docker group.

Getting started

cd <MyTopDir>
mkdir pocket-daq
cd pocket-daq
POCKDAQDIR=$PWD

Prepare a daq workarea with modified restcmd

This version of restcmd allows sending the command notification to a different host from the sender.

source ~np04daq/bin/web_proxy.sh
cd $POCKDAQDIR
source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt dunedaq-v2.10.0-cs8
dbt-create.py dunedaq-v2.10.0-cs8 dunedaq-workdir
cd dunedaq-workdir
cd sourcecode
git clone https://github.com/DUNE-DAQ/restcmd.git
cd ..
dbt-workarea-env
dbt-build.py
Steps for re-setting up this `dunedaq-workdir` software area when you come back to it later

...for example, after you have logged out and logged back in...

cd <MyTopDir>/pocket-daq
POCKDAQDIR=$PWD
cd $POCKDAQDIR/dunedaq-workdir
source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt dunedaq-v2.10.0-cs8
dbt-workarea-env

Install nanorc

(for later use) This version of nanorc is modified to interface to the kind control plane to manage the daq processes.

source ~np04daq/bin/web_proxy.sh  # if not already done
git clone [email protected]:DUNE-DAQ/nanorc.git -b plasorak/k8s
cd nanorc
pip install -e .

Create the daq_application docker image

source ~np04daq/bin/web_proxy.sh  # if not already done

# Back to `pocket-daq`
cd $POCKDAQDIR
git clone https://github.com/DUNE-DAQ/pocket.git -b thea/kind-1.20.0
cd pocket/images/daq_application/daq_area_cvmfs

# Builds a docker image importing the `dunedaq-k8s` dbt work area. For TRACE, see note below.
./build.sh ../../../../dunedaq-workdir
Accessing high-speed TRACE memory buffer file Assuming TRACE_FILE=/dunedaq/pocket/trace_buffer in rebuild_work_area.sh before running ./build.sh above - on the host, `export TRACE_FILE=$PACKDAQDIR/pocket/share/trace_buffer`. This will give you access from the host to the TRACE_FILE and control TRACEing which happens in the containers.

Now the image pocket-daq-area-cvmfs:v2.10.0 image should have been created; you can check that by doing:

docker images
REPOSITORY            TAG
...
pocket-daq-area-cvmfs v2.10.0
...

Note: If you use a different release, the v2.10.0 tag will be different, and you will need to change nanorc! Check src/nanorc/k8spm.py around line 141, daq_app_image.

Time to start the cluster...

source ~np04daq/bin/web_proxy.sh  # if not already done

# Start the cluster
cd $POCKDAQDIR/pocket
SERVICES=cvmfs,opmon,ers make setup.local

## Make your shell use binaries (`kubectl`, ...) that pocket ships with
eval $(make env)

# Load the new docker image into the cluster registry
kind load docker-image pocket-daq-area-cvmfs:v2.10.0 --name pocketdune

NOTE 1: The pocket-daq image needs to be reloaded every time the cluster is restarted.
NOTE 2: kafka takes a while to come up. No messages will appear on daqerrordisplay or on the grafana dashboard until the broker is operational. The pod status can be monitored on the dashboard or by running kubectl get -A pod. Pod logs can be examined by running kubectl logs <podname> --namespace=<namespace_name>. More detail on the status of a pod can be obtained using kubectl describe pod/<podname> --namespace=<namespace_name>.

Steps to re-start the cluster when you come back to this 'pocket' software area later

... for example, after you have shut down the cluster, logged out, and logged back in...

cd <MyTopDir>/pocket-daq
POCKDAQDIR=$PWD
source ~np04daq/bin/web_proxy.sh  # if not already done
cd $POCKDAQDIR/pocket
SERVICES=cvmfs,opmon,ers make setup.local
eval $(make env)
kind load docker-image pocket-daq-cvmfs:v0.1.0 --name pocketdune

# (OR, the following if the cluster is already running)

cd $POCKDAQDIR/pocket
eval $(make env)

Please Note that the startup of the cluster a second time, or at least the successful loading of the web pages described below, can take a long time. Up to 10 minutes.

Networking

If a firewall is running, it needs to be tweaked to allow the daq apps to report status changes back to nanorc. You can check whether a firewall is running by using the following command: ps -ef | grep -i firewalld.

Use docker network ls to find the network named kind

--(~)--> docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
71246ea5c5d7        bridge              bridge              local
24580441c323        host                host                local
7fa13261749f        kind                bridge              local
3e5da2d00bba        none                null                local

The ID should match a bridge network interface on the host called something like br-<id> in ifconfig. Find it and use the following command to put it in the trusted zone.

sudo firewall-cmd --permanent --zone=trusted --change-interface=br-7fa13261749f
sudo firewall-cmd --reload

If firewalld is not running, the above command is not needed.

Opening web pages for the various services that are started

The make setup.local and kind load commands above start the Kubernetes cluster for you. Once it is running, you can point your browser to several different pages to check on the status of the cluster and see the graphical displays that are available.

ToDo: provide instructions for setting up a tunnel so that we can visit web pages on np04daq computers from browsers running on computers outside of CERN. In the meantime, I'll just include a link to the part of Marco's 'graphical viewer' page that talks about the tunnel. That is not a perfect reference for what is needed here, but it's a good start.

The definitive list of available services and their ports is printed to the console when you run the make setup.local command above. That list includes the 'in-cluster' addresses, the 'out-cluster' addresses, and the username and password, if those are needed. All of that is very useful, so you should take a look.

In the meantime, here is a non-definitive list, for reference:

  • Kubernetes dashboard: http://<host>:31001
  • Grafana: http://<host>:31003
  • ERS: http://<host>:30080
  • Kafka: http://<host>:30092
  • InfluxDB: http://<host>:31002

where <host> is something like np04-srv-024.

Generate a daqconf configuration

NOTE: This needs to be modified so that daq_application access the correct path for ERS/Grafana/frames.bin...

cd $POCKDAQDIR
mkdir runarea
cd runarea
daqconf_multiru_gen test -d /dunedaq/pocket/frames.bin -o /dunedaq/pocket

AND download frames.bin into pocket/share.

curl -o ${POCKDAQDIR}/pocket/share/frames.bin -O https://cernbox.cern.ch/index.php/s/7qNnuxD8igDOVJT/download

pocket/share is mounted on daq_application containers as /dunedaq/pocket, where the ruemu application is instructed to load the raw data file from.

FINALLY

source ~np04daq/bin/web_proxy.sh -u
nanorc --k8s test
user@rc> boot 
[18:15:35] INFO     test_k8s_2 received command 'boot'
           INFO     test_k8s_2 propagating to children nodes (['test_k8s_2']) simultaneously
           INFO     Subsystem test_k8s_2 is booting
[18:15:35] Creating a namespace 'user-dunedaq' in kubernetes to hold your DAQ applications
           INFO     Resolving the kind gateway
           INFO     kind network gateway: 172.18.0.1
           INFO     Creating user-dunedaq namespace
           INFO     Creating user-dunedaq:dataflow0 daq application (port: 3333)
           INFO     Creating user-dunedaq:dfo daq application (port: 3333)
           INFO     Creating user-dunedaq:hsi daq application (port: 3333
           INFO     Creating user-dunedaq:ruemu0 daq application (port: 3333)
           INFO     Creating user-dunedaq:trigger daq application (port: 3333)
...
[18:15:39] INFO     Application dataflow0 booted
           INFO     Application dfo booted
           INFO     Application hsi booted
[18:15:40] INFO     Application ruemu0 booted
           INFO     Application trigger booted
                                          test_k8s_2 apps
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ name              ┃ state          ┃ host                   ┃ pings ┃ last cmd ┃ last succ. cmd ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ test_k8s_2        │ booted         │                        │       │          │                │
│ └── test_k8s_2    │ booted         │                        │       │          │                │
│     ├── dataflow0 │ booted - alive │ dataflow0.user-dunedaq │ True  │ None     │ None           │
│     ├── dfo       │ booted - alive │ dfo.user-dunedaq       │ True  │ None     │ None           │
│     ├── hsi       │ booted - alive │ hsi.user-dunedaq       │ True  │ None     │ None           │
│     ├── ruemu0    │ booted - alive │ ruemu0.user-dunedaq    │ True  │ None     │ None           │
│     └── trigger   │ booted - alive │ trigger.user-dunedaq   │ True  │ None     │ None           │
└───────────────────┴────────────────┴────────────────────────┴───────┴──────────┴────────────────┘

The pods will take some time to come up. nanorc queries the control plane and waits for the application to open the command port (3333). The scheduling of the applications can be followed on the pocket dashboard at http://localhost:31001. (On the dashboard, it is helpful to change the namespace selection at the top of the page from "default" to "All namespaces". With this, it is easier to find the pods associated with your partition.)

At that point init, conf, start can be issued.

Note: don't forget to choose the correct partition name in the dunedaq grafana dashboard to be able to see the ers issue flowing in.

Logs

We all love application logs, here is how to get them: First, open a new terminal window on the same host, then go to the pocket directory and do:

eval $(make env)

Now you have k8s in your PATH, so you can do: kubectl get pods -n <partition_name>

Note: partition_name can be given as a parameter to the nanorc boot command: boot --partition <partition_name> and the default partition name seems to currently be "user-dunedaq" The partition_name is used to create a namespace during the boot process.

For example, when partition_name is "user-dunedaq":

kubectl get pods -n user-dunedaq
NAME                         READY   STATUS    RESTARTS   AGE
dataflow0-84d77d48c9-mcbq9   1/1     Running   0          66s
...

And you can use the pod name with the kubectl logs command:

kubectl logs dataflow0-84d77d48c9-mcbq9 -n user-dunedaq

Destroying the cluster

To destroy when you are finished, run

cd $POCKDAQDIR/pocket
make destroy.local

This will delete the cluster entirely, along with any state.

Please note that if you get logged out of your shell window on the np04daq cluster before you run these steps to destroy the cluster, you can log back in and run them without having to re-do any of the steps from earlier sections of these instructions.