Metrics & Monitoring with collectd, InfluxDB, and Grafana

The following is a lightly updated and expanded version of a talk given at the North Bay Linux Users’ Group July 11th, 2017.

Introduction

Metrics & Monitoring
with collectd,
InfluxDB, and Grafana

Hi, I’m Tom, and I like graphing things.

I’m going to talk about how I set up monitoring at home using a few open source tools.

I started down this road a few months ago when started trying to diagnose performance issues with my home NAS. It’s metastasized from there — now I monitor my Internet bandwidth use, disk use, and even the temperature and humidity of plants.

collectd → InfluxDB → Grafana
  gets      stores     shows

I’m going to give an overview of a monitoring system with three components:

collectd — a daemon that collects metrics from the system.
InfluxDB — a specialized timeseries database for storing metrics.
Grafana — a web-based dashboard UI that lets us build dashboards that display metrics in graphs.

Demo

(Live demo of a few Grafana dashboards, including a NodeMCU equipped with a temperature and humidity sensor.)

A dual-scale graph of temperature and humidity, which fluctuate diurnally over the course of a week. — Example 1: Temperature and humidity reported by a DHT11 sensor via a NodeMCU.

• Configuring collectd to get useful metrics
• Storing metrics in InfluxDB
• Graphing with Grafana

Now, we’ll take a look at how to set up a system like that.

Since we just saw Grafana, let’s start there.

Grafana

Grafana
Open Source
Apache 2.0 license

Grafana is an open source dashboarding and visualization tool.

~~It’s licensed under the Apache 2.0 license.~~

[Update: It was relicensed as AGPLv3 in 2021.]

Install from .deb
I demoed version 4.4

The project offers packages for all major distributions.

I install it from the Debian/Ubuntu repository.

Backends
InfluxDB, Graphite, Elasticsearch,
Prometheus, OpenTSDB

Grafana is a flexible tool: it can talk to many time-series databases, and even regular SQL databases like PostgreSQL.

You can mix and match data sources, too.

collectd

collectd

Plugin-based metric collection system.

It’s open source — GPLv2 — and cross-platform to various UNIX-like systems.

Implemented in C, so it has a very low resource footprint.

Gathering Metrics with collectd

By default, collectd stores the metrics it collects in RRDtool database files.

It can also send them over the network to another collectd instance or a database (like InfluxDB).

I disable the RRDtool storage plugin because it produces many small write operations. Definitely disable RRDtool if you run it on a Raspberry Pi or it’ll eat up your SD card.

collectd is cross-platform
...but I’m talking about Linux

Caveat: I’m only familiar with collectd on Linux.

The specific metrics it gathers do vary by OS.

Configuring collectd

# $EDITOR /etc/collectd/collectd.conf

collectd is configured in /etc/collectd/collectd.conf

Hostname "mymachine"
FQDNLookup false

The first thing you should configure is the name of the host.

I set this to the machine’s hostname and tell collectd not to convert it into a fully-qualified name, since hosts on my LAN don’t have FQDNs.

This name affects the name of metrics generated and must be unique.

$ sudo systemctl restart collectd
# may take a few seconds..
$ sudo systemctl status collectd
● collectd.service - Statistics collection and monitoring daemon
   Loaded: loaded (/lib/systemd/system/collectd.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2017-07-08 02:23:59 PDT; 4s ago
     Docs: man:collectd(1)
           man:collectd.conf(5)
           https://collectd.org
  Process: 9433 ExecStartPre=/usr/sbin/collectd -t (code=exited, status=0/SUCCESS)
 Main PID: 9444 (collectd)
    Tasks: 13
   Memory: 17.6M
      CPU: 2.197s
   CGroup: /system.slice/collectd.service
           ├─9444 /usr/sbin/collectd
           └─9470 /usr/bin/python3 /usr/local/bin/uverse-collect

Jul 08 02:23:59 aorist systemd[1]: Starting Statistics collection and monitoring daemon...
Jul 08 02:23:59 aorist collectd[9444]: supervised by systemd, will signal readyness
Jul 08 02:23:59 aorist collectd[9444]: cpufreq plugin: Found 4 CPUs
Jul 08 02:23:59 aorist systemd[1]: Started Statistics collection and monitoring daemon.
Jul 08 02:23:59 aorist collectd[9444]: Initialization complete, entering read-loop.

To apply changes, restart collectd with systemctl restart collectd

collectd Plugins

Loading Plugins

collectd includes many plugins in the core executable, and can dynamically load others.

You can even write scripts to do custom collection.

LoadPlugin foo

<Plugin foo>
    Bar "true"
    Baz "1234"
</Plugin>

For each plugin you need a LoadPlugin (only once!).

Some plugins require further configuration in a <Plugin> block.

Basic Monitoring

LoadPlugin load
LoadPlugin cpu
LoadPlugin cpufreq

There are a bunch of plugins you’ll always want:

load gets you load average (like what top reports)
cpu samples you CPU utilization
cpufreq captures CPU frequency, which is useful for interpreting CPU utilization

A graph of CPU frequency, which varies between 1 GHz and 1.499 GHz over the course of an hour. — Example 2: CPU frequency captured by the `cpufreq` plugin.

A graph of system 1-, 5-, and 15-minute load average, which rises and falls over the course of an hour. — Example 3: Load average captured by the `load` plugin over the same period.

LoadPlugin memory
LoadPlugin swap
LoadPlugin uptime
LoadPlugin users

memory provides memory use
swap captures swap use
uptime does what you’d expect
users samples the number of active user sessions (i.e. SSH logins)

LoadPlugin processes

Loading the processes plugin gets you counts of processes in various states: running, sleeping, blocked, zombie, etc. like ps reports.

<Plugin processes>
    Process "name"
    ProcessMatch "regex"
</Plugin>

You can get really detailed stats on a specific process, too.
This includes process and thread count, CPU time, memory, page faults, and I/O.
This is mostly useful for daemons and such.

Monitoring Disks

LoadPlugin df    # capacity
LoadPlugin disk  # I/O operations

df records available space, while disk tracks I/O operation counters.

I suggest that you select specific block devices or filesystems, as there’s little point monitoring the temporary filesystems created by Docker or Snap. For example:

<Plugin df>
    MountPoint "/"
    IgnoreSelected false
</Plugin>

# S.M.A.R.T. via smartmontools
LoadPlugin smart

<Plugin smart>
    Disk "/^[hs]d[a-f][0-9]?$/"
    IgnoreSelected false
</Plugin>

Likewise, you can gather disk health data with the smart plugin.

Sensors

Sensors
Temperature, voltage,
& fan speed

collectd integrates with lm_sensors, though this requires some additional configuration.

LoadPlugin sensors

Load the sensors plugin.

You must configure lm_sensors:

# sensors-detect

collectd reads /etc/sensors3.conf

Run sensors-detect as root to generate the /etc/sensors3.conf

Network I/O

LoadPlugin interface

<Plugin interface>
    Interface "/^en/"
    IgnoreSelected false
</Plugin>

Monitor network I/O with the interface plugin.

By default it monitors all interfaces. I suggest limiting it to specific physical interfaces.

interface_rx and interface_tx measurements

{host: ..., instance: enp0s31f6, type: if_octets}
{host: ..., instance: enp0s31f6, type: if_packets}
{host: ..., instance: enp0s31f6, type: if_errors}

Counters, so take the non-negative derivative.

This captures three types of metrics for each interface (transmit and receive):

if_octets — bytes transmitted
if_packets — the number of datagrams transmitted
if_errors — count of errors like checksum mismatches

These are all counters, so they may overflow. When graphing, take the non-negative derivative to get a rate.

SNMP

SNMP Devices

Many network devices can be monitored via the “Simple Network Management Protocol”, SNMP.

I’m not going to go deep into SNMP now, but provide a few pointers.

# apt install snmp snmp-mibs-downloader

# $EDITOR /etc/snmp/snmp.conf

First, note that the default collectd.conf includes examples that reference MIB files that aren’t included in the Debian/Ubuntu snmp package due to their licensing.

Obtain them by running snmp-mibs-downloader.

Then, edit snmp.conf to enable loading them (the comment there tells you to comment out a line).

LoadPlugin snmp

<Plugin snmp>
    <Data "std_traffic">
        Type "if_octets"
        Table true
        InstancePrefix "traffic"
        Instance "IF-MIB::ifDescr"
        Values "IF-MIB::ifInOctets" "IF-MIB::ifOutOctets"
    </Data>
    <Host "router">
        Address "192.168.1.1"
        Version 2
        Community "mycommunity"
        Collect "std_traffic"
    </Host>
</Plugin>

But home routers lack SNMP...

Unfortunately, few consumer-grade routers support SNMP. How do we monitor them?

Custom Plugins

Scraping!

The exec plugin lets you run a custom script as a subprocess.

The script periodically prints metrics.

I use this script to monitor my AT&T cable modem:

<Plugin exec>
    Exec nobody "/usr/local/bin/uverse-collect.py"
</Plugin>

It pulls the interface metrics from the HTML admin page the router provides every 30 seconds.

InfluxDB

InfluxDB
Open Core
MIT license

I am presently storing metrics in InfluxDB, a database specialized to handle time-series data. It’s “open core”, meaning there is a functional open source version of the product, but some features are gated behind “enterprise” licensing.

InfluxDB natively supports the collectd network protocol, so setup is easy:

<Plugin network>
    <Server "ip-of-influxdb" "25826">
        SecurityLevel None  # as appropriate
    </Server>
<Plugin>

Nice SQL-like query language
Easy to install (.deb)

InfluxDB includes a nice SQL-like query language and is easy to install.

It’s kinder on the disk than RRDtool, which was attractive when I was prototyping all of this on a Raspberry Pi.

Eats disk and memory

However, it can also eat disk and memory.

The load from it spikes when it periodically compacts its logs.

This is fine on a PC, but don’t try to run it on a Raspberry Pi — you’ll run out of memory if you have many metrics.

Alternative Solutions

Was all this too complicated?

I’ve been talking about how to assemble a monitoring system from parts. However, there are some simpler options.

Munin
5 minute resolution
Great out-of-the-box defaults
Easy to script collection
Doesn’t scale up well

Munin is a simple system with good defaults. It collects most of what you want with nothing more than an apt install. Simply serve up the HTML and images it generates with any web server.

You can run the graph generation on a central server and the munin-node agent on each machine you want to monitor. Gathering custom metrics is simple too: write a script (in any language) and Munin will run it every five minutes.

Munin’s main disadvantages are that five-minute polling interval and that it (by default) generates all graphs up-front. The RRDTool database it uses also generates many random writes. A decent SSD won’t mind, but it’ll kill a SD card if you try it on a Raspberry Pi.