Monitoring Docker Swarm with cAdvisor, InfluxDB, and Grafana for Keep-Network project.

Ninafurs
11 min readOct 31, 2020

--

To track the status of running applications, you must constantly monitor them. And if your applications run in a highly scalable environment like Docker Swarm, you will also need a highly scalable monitoring tool. This article talks about setting up just such a tool.

In the process, we will install cAdvisor agents on each node to collect host and container metrics. Metrics will be saved in InfluxDB. To build graphs based on these metrics, use Grafana. These tools are distributed as open source and can be deployed as containers.

To build a cluster, we will use Docker Swarm Mode and deploy the necessary services as a stack. This will allow you to organize a dynamic monitoring system that can automatically start monitoring new nodes as they are added to the swarm (swarm). The project files can be found here. (link)

Overview of tools

The choice of monitoring systems is quite large. To build our stack, we will use open source services that work well in containers. Next, I will describe the composition of the stack.

cAdvisor

cAdvisor will collect the metrics of the hosts and containers. It is installed as a docker image with a docker socket attached as a shared volume and the root file system on the host. cAdvisor can write collected metrics to several types of time-series databases, including InfluxDB, Prometheus, and so on. It even has a web interface where graphs are built based on the collected data.

InfluxDB

InfluxDB - is an open source time series database that allows you to store numeric metrics and assign tags to them. This system implements an SQL-like query language that can be used to work with stored data. Events we will filter using the tags for the host or even the container.

Grafana

Grafana is a popular visualization tool that allows you to create toolbars by getting data from Graphite, Elasticsearch, OpenTSDB, Prometheus, and of course, InfluxDB. Starting with the fourth version, you can configure notifications based on query results. We will create a toolbar that you can use to display data for a specific host and service.

Docker Swarm Mode

Swarm Mode has been introduced in Docker since version 1.12.0. it allows you to easily create a swarm from multiple hosts and easily manage it. To ensure that the built-in service discovery and orchestration mechanisms work, Swarm mode implements key-value storage. Hosts can serve as a Manager or worker node. In General, the Manager is responsible for the orchestration function, and containers are executed on the worker nodes. Since this is a demo installation, we will host InfluxDB and Grafana on the Manager.

Swarm Mode has an interesting function called routing mesh, which acts as a virtual load balancer. Let’s say we have 10 containers listening on port 80 that are running on 5 nodes. If you try to access port 80 of one of these containers, the request can be sent to any of them, even running on a different host. Thus, by publishing the IP address of any node, you automatically enable request balancing between ten containers.

The swarm will consist of three local VMS, which we will deploy using the Docker-machine plugin Virtualbox. To do this, you must have the Virtualbox virtualization system installed. Using other plugins, you can deploy VMS in cloud services. The steps after creating the machines will be the same for all plugins.

When creating VMS, we will leave the default options. The host that performs the swarm Manager function is called manager, and the worker nodes are keep_ecdsa and keep_beacon. You can create as many notes as you want. Just repeat the above commands with a different host name. To create a VM, run the following commands:

docker-machine create manager
docker-machine create keep_ecdsa
docker-machine create keep_beacon

These commands may take some time to complete.

To use the docker engine on the manager host, you must switch the context. Next, we will run the commands in docker installed on the manager host, NOT on the local system. To do this, run the command:

eval `docker-machine env manager`

Now that we have switched to Manager in docker, we initialize this host as the swarm Manager. We will need its IP address, which will be published on other connected nodes. The docker-machine ip manager command allows you to get the necessary information. So, to create a swarm, run the following command:

docker swarm init --advertise-addr `docker-machine ip manager`

Now we need two worker nodes. To do this, pass the Join Token and the IP published when creating the swarm. To get a token, run the docker swarm join-token-qworker command. The docker-machine ip manager command, as before, will allow you to get the Manager’s IP and its standard port 2377. We could add new machines to the swarm by switching to the context of each worker node in turn, but it’s much easier to execute these commands over SSH. To attach worker nodes to an edge, run the following commands:

docker-machine ssh keep_ecdsa docker swarm join --token `docker swarm join-token -q worker` `docker-machine ip manager`:2377

docker-machine ssh keep_beacon docker swarm join --token `docker swarm join-token -q worker` `docker-machine ip manager`:2377

The list of notes included in the swarm can be output using the docker models command. After adding worker nodes, the output should look like this:

ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS 3j231njh03spl0j8h67z069cy *  manager   Ready   Active        Leader muxpteij6aldkixnl31f0asar    keep_ecdsa    Ready   Active y2gstaqpqix1exz09nyjn8z41    keep_beacon   Ready   Active

Docker Stack

With the third-version docker-compose file, you can define the entire service stack in a single file, including the deployment strategy, and perform deployment with a single deploy command. The main difference between the third version of the docker-compose file and the second was the appearance of the deploy parameter in the description of each service. This parameter determines how containers are deployed. The docker-compose file for the test monitoring system is shown below:

version: '3'  services: 
influx:
image: influxdb
volumes:
- influx:/var/lib/influxdb
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
grafana:
image: grafana/grafana
ports:
- 0.0.0.0:80:3000
volumes:
- grafana:/var/lib/grafana
depends_on:
- influx
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
cadvisor:
image: google/cadvisor
hostname: '{{.Node.ID}}'
command: -logtostderr -docker_only -storage_driver=influxdb -storage_driver_db=cadvisor -storage_driver_host=influx:8086
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
depends_on:
- influx
deploy:
mode: global
volumes:
influx:
driver: local
grafana:
driver: local

There are 3 services in our stack, which are described below.

influx

Here we will use the influxdb image. For permanent storage, create an influx volume that will be mounted in the container folder /var/lib/influxdb. We only need one copy of InfluxDB, which will be hosted on the manager host. The docker server is running on the same host, so commands for the container can be executed here. Since both remaining services need influxDB, we will add the depends_on key with the value of influx to the description of these services.

Grafana

We will use the grafana/grafana image and forward the container’s 3000th port to the host’s 80th port. The route grid allows you to connect to grafana via port 80 of any host that is part of the swarm. For permanent data storage, we will create another volume called grafana. It will be mounted in the container folder /var/lib/grafana. We will also deploy Grafana on the manager host.

cAdvisor

To set up cAdvisor, you will have to work a little more than with previous services. Choosing the hostname value in this case is not an easy task. We are going to install agents on each node, and this container will collect metrics of the node and the containers running on it. When cAdvisor sends metrics to InfluxDB, it sets the machine tag, which contains the name of the cAdvisor container. Its value must match the ID of the node on which it is running. In Docker stacks, you can use templates in names. We have given the containers names containing the ID of the node they are running on, so we can determine where the metric came from. This is achieved by using the following expression ‘{{. Node.ID}}’.

We will also add several command-line parameters to cadvisor. The logtostderr parameter redirects the generated cadvsior logs to stderr, which simplifies debugging. The docker_only flag says that we are only interested in docker containers. The following three parameters determine the storage location where the collected metrics should be placed. We will ask cAdvisor to put them in the cadvisor database on the InfluxDB server listening on influx: 8086. This will allow you to configure sending metrics to our stack’s flux service. Inside the stack, all ports are exposed, so you don’t need to specify them separately.

The volumes specified in the file are needed by cAdvisor to collect metrics from the host and Docker. To deploy cadvisor, we will use global mode. This ensures that only one instance of the cadvisor service is running on each node in the swarm.

At the end of the file, we have the volumes key, which specifies the volumes of influx and grafana. Since both volumes will be hosted on the manager host, we will write the local driver for them.

To deploy the stack, save the file above as docker-stack. yml and run the following command:

docker stack deploy -c docker-stack.yml monitor

It will start the monitor stack services. The first run of the command may take some time, since the nodes must load container images. You will also need to create a database in InfluxDB for storing metrics called cadvisor.

docker exec `docker ps | grep -i influx | awk '{print $1}'` influx -execute 'CREATE DATABASE cadvisor'

The command may fail with the message that the influx container does not exist. The reason for the error is that the container is not ready yet. Wait a bit and run the command again. We can run commands in the influx service because it is running on the manager host and we use the docker installed here. To find out the ID of the container with InfluxDB, you can use the docker ps | grep-i flux | awk ‘{print $1} ‘command, and to create a database named cadvisor, run the command flux-execute ‘CREATE database cadvisor’.

To list the stack services, run docker stack services monitor. The output of the command will look something like this:

ID            NAME              MODE        REPLICAS  IMAGE 0fru8w12pqdx  monitor_influx    replicated  1/1       influxdb:latest
m4r34h5ho984 monitor_grafana replicated 1/1 grafana/grafana:latest
s1yeap330m7e monitor_cadvisor global 3/3 google/cadvisor:latest

Setting Up Grafana

After all the services are deployed, you can open grafana. The IP address of any swarm node is suitable for this purpose. We will specify the Manager’s IP address by running the following command:

open http://`docker-machine ip manager`

By default, the user name admin and password admin are used to log in to grafana. In grafana, add InfluxDB as the data source. The home page should contain the link Create your first data source, click on it. If there is no link, select Add data source from the Data Sources menu, which opens the form for adding a new Data Source.

You can give any name to the data source. Check the default checkbox so that you don’t have to specify it in other forms in the future. Next, set Type to InfluxDB, URL — http://influx:8086 and Access-proxy. This is how we pointed to our InfluxDb container. In Database field enter cadvisor and click Save and Test and you should see the message Data source is working.

The project’s github repository {Link}has a dashboard.json file created for import to Grafana. It describes a dashboard for monitoring systems and containers that run in the swarm. We’re just importing this toolbar now, and we’ll talk about it in the next section. Hover over the Dashboards menu item and select Import Option. Click the Upload button .json file and select dashboard. json. Next, select the data source and click Import.

Grafana Dashboard

The toolbar imported into Grafana is designed for monitoring swarm hosts and containers. You can drill down to the host level and the containers running on it. We will need two variables that require template functionality to be added to the Grafana toolbar. We have two variables: host for node selection and container for container selection. To view these variables, on the toolbar page, select Settings and click Templating.

The first variable — host — allows you to select the node and its metrics. When advisor sends metrics to InfluxDB, it attaches several tags to them that can be used for filtering. We have a tag called machine, which contains the hostname of the cAdvisor instance. In this case, it will match the host ID in the swarm. To get tag values, use the show tag values with key = “machine”query.

The second variable, container, allows you to drill down to the container level. We have a tag named container_name, which predictably contains the name of the container. We also need to filter metrics by the value of the host tag. The request will look like this: show tag values with key = “container_name” WHERE machine =~ /^$host$/. It will return us a list of containers where the host variable contains the name of the host we are interested in.

The container name will look something like this:

monitor_cadvisor.y78ac29r904m8uy6hxffb7uvn.3j231njh03spl0j8h67z069cy. However, we are only interested in the monitor_cadvisor part of it (up to the first point). If multiple instances of the same service are running, their data will need to be output in separate lines. To get a substring up to the first point, use the regular expression /([^.]+)/.

We have configured the variables, and now we can use them in charts. Next, we will talk about the Memory graph, and you can work with the rest on the same principle. Memory-related data is located in InfluxDB in the memory_usage row, so the query will start with SELECT “ value “FROM”memory_usage”.

Now you need to add filters to the where expression. The first condition is that machine equals the value of the host variable: “machine” = ~ /^$host$/. In the following condition, container_name must start with the value of the container variable. Here we will use the “starts with” operator, since we have filtered the container variable to the first point: “container_name” =~ /^$container$*/. The last condition imposes a limit on the time of events according to the $time Filter time interval selected in the grafana toolbar. The request now looks like this:

SELECT "value" FROM "memory_usage" WHERE "container_name" =~ /^$container$*/ AND "machine" =~ /^$host$/ AND $timeFilter

Since we need separate rows for different hosts and containers, we need to group the data based on the values of the machine and container_name tags:

SELECT "value" FROM "memory_usage" WHERE "container_name" =~ /^$container$*/ AND "machine" =~ /^$host$/ AND $timeFilter GROUP BY "machine", "container_name"

We also created an alias for this request: Memory {host: $tag_machine, container: $tag_container_name}. Here, $tag_machine will be replaced with the value of the machine tag, and tag_container_name with the value of the container_name tag. Other charts are configured in a similar way, only the names of the series are changed. You can create alerts for these metrics in Grafana.

Conclusion

In this article, we created a scalable monitoring system for Docker Swarm, which automatically collects metrics from all hosts and containers included in the swarm. In the process, we got acquainted with popular open source tools: Grafana, InfluxDB, and cAdvisor.

--

--