The following is a lightly updated and expanded version of a talk given at the North Bay Linux Users’ Group July 11th, 2017.
Hi, I’m Tom, and I like graphing things.
I’m going to talk about how I set up monitoring at home using a few open source tools.
I started down this road a few months ago when started trying to diagnose performance issues with my home NAS. It’s metastasized from there — now I monitor my Internet bandwidth use, disk use, and even the temperature and humidity of plants.
I’m going to give an overview of a monitoring system with three components:
(Live demo of a few Grafana dashboards, including a NodeMCU equipped with a temperature and humidity sensor.)
Now, we’ll take a look at how to set up a system like that.
Since we just saw Grafana, let’s start there.
The project offers packages for all major distributions.
I install it from the Debian/Ubuntu repository.
Grafana is a flexible tool: it can talk to many time-series databases, and even regular SQL databases like PostgreSQL.
You can mix and match data sources, too.
Plugin-based metric collection system.
It’s open source — GPLv2 — and cross-platform to various UNIX-like systems.
Implemented in C, so it has a very low resource footprint.
By default, collectd stores the metrics it collects in RRDtool database files.
It can also send them over the network to another collectd instance or a database (like InfluxDB).
I disable the RRDtool storage plugin because it produces many small write operations. Definitely disable RRDtool if you run it on a Raspberry Pi or it’ll eat up your SD card.
Caveat: I’m only familiar with collectd on Linux.
The specific metrics it gathers do vary by OS.
collectd is configured in /etc/collectd/collectd.conf
The first thing you should configure is the name of the host.
I set this to the machine’s hostname and tell collectd not to convert it into a fully-qualified name, since hosts on my LAN don’t have FQDNs.
This name affects the name of metrics generated and must be unique.
To apply changes, restart collectd with systemctl restart collectd
collectd includes many plugins in the core executable, and can dynamically load others.
You can even write scripts to do custom collection.
For each plugin you need a LoadPlugin
(only once!).
Some plugins require further configuration in a <Plugin>
block.
There are a bunch of plugins you’ll always want:
load
gets you load average (like what top
reports)cpu
samples you CPU utilizationcpufreq
captures CPU frequency, which is useful for interpreting CPU utilizationmemory
provides memory useswap
captures swap useuptime
does what you’d expectusers
samples the number of active user sessions (i.e. SSH logins)Loading the processes
plugin gets you counts of processes in various states: running, sleeping, blocked, zombie, etc. like ps
reports.
df
records available space, while disk
tracks I/O operation counters.
I suggest that you select specific block devices or filesystems, as there’s little point monitoring the temporary filesystems created by Docker or Snap. For example:
<Plugin df> MountPoint "/" IgnoreSelected false </Plugin>
Likewise, you can gather disk health data with the smart
plugin.
collectd integrates with lm_sensors, though this requires some additional configuration.
Load the sensors
plugin.
Run sensors-detect
as root to generate the /etc/sensors3.conf
Monitor network I/O with the interface
plugin.
By default it monitors all interfaces. I suggest limiting it to specific physical interfaces.
This captures three types of metrics for each interface (transmit and receive):
if_octets
— bytes transmittedif_packets
— the number of datagrams transmittedif_errors
— count of errors like checksum mismatchesThese are all counters, so they may overflow. When graphing, take the non-negative derivative to get a rate.
Many network devices can be monitored via the “Simple Network Management Protocol”, SNMP.
I’m not going to go deep into SNMP now, but provide a few pointers.
First, note that the default collectd.conf
includes examples that reference MIB files that aren’t included in the Debian/Ubuntu snmp
package due to their licensing.
Obtain them by running snmp-mibs-downloader
.
Then, edit snmp.conf
to enable loading them (the comment there tells you to comment out a line).
Unfortunately, few consumer-grade routers support SNMP. How do we monitor them?
The exec
plugin lets you run a custom script as a subprocess.
The script periodically prints metrics.
I use this script to monitor my AT&T cable modem:
<Plugin exec>
Exec nobody "/usr/local/bin/uverse-collect.py"
</Plugin>
It pulls the interface metrics from the HTML admin page the router provides every 30 seconds.
I am presently storing metrics in InfluxDB, a database specialized to handle time-series data. It’s “open core”, meaning there is a functional open source version of the product, but some features are gated behind “enterprise” licensing.
InfluxDB natively supports the collectd network protocol, so setup is easy:
<Plugin network>
<Server "ip-of-influxdb" "25826">
SecurityLevel None # as appropriate
</Server>
<Plugin>
InfluxDB includes a nice SQL-like query language and is easy to install.
It’s kinder on the disk than RRDtool, which was attractive when I was prototyping all of this on a Raspberry Pi.
However, it can also eat disk and memory.
The load from it spikes when it periodically compacts its logs.
This is fine on a PC, but don’t try to run it on a Raspberry Pi — you’ll run out of memory if you have many metrics.
I’ve been talking about how to assemble a monitoring system from parts. However, there are some simpler options.
Munin is a simple system with good defaults.
It collects most of what you want with nothing more than an apt install
.
Simply serve up the HTML and images it generates with any web server.
You can run the graph generation on a central server and the munin-node
agent on each machine you want to monitor.
Gathering custom metrics is simple too: write a script (in any language) and Munin will run it every five minutes.
Munin’s main disadvantages are that five-minute polling interval and that it (by default) generates all graphs up-front. The RRDTool database it uses also generates many random writes. A decent SSD won’t mind, but it’ll kill a SD card if you try it on a Raspberry Pi.
See also: Cacti (PHP instead of Perl!)
Do you need a Web Scale™ metrics system?
Prometheus is the trendy new hotness. I haven’t tried it out yet.
It’s supposed to deal better with dynamic metrics, like you get when monitoring containers.
© 2017, 2018, 2024 Tom Most