It certainly means different things to different people:
As sysadmins, we want to know our systems are ok, to the slightest detail, and if not, what is wrong. Preferably before, or at least while it happens. As (enlightened) developers we want to be able to follow our application’s behaviours in production. As service managers we want to know if we’re delivering the service as agreed. What‘s up, down, for how long, how slow, who‘s to blame. As managers we might want to know the bottom line, of how many downloads, sales,..in a pretty Widget. You can probably think of a few more. And you’re probably not happy with that you have (or you wouldn’t have monitoringsux), which is most likely different systems runing besides each other, with different paradigms, platform-dependent API’s (if any), Different Web GUI with lineair lists and state colors, that you force to refresh ever few minutes, and even then look at old information, and that have credentials to authenticate to the systems they monitor, making them dangerous points of failure in terms of security.
Depending on what you’re trying to monitor, you may be OK with all of these.
But if you’re like us, you’ll end up in a multi-everything (platform, application, networks, silos, sites, policies) environment with no end of interdependencies, where at least some applications are interactive and time-critical, and the sysadmins and developers are collegues, horizontal team members, or all the same people. This is the type of environment that we’re growing Extremon for.
Taking 3 important headlines from the Extreme Monitoring Manifesto
Most of the data you’re gathering will be required for different purposes. The service response time and validity you’re testing in a functional test tells the sysadmins in the data center that it’s fast enough, tells the developer that his caching strategy works in production, tells the service manager that you’re ok with the SLA, tells the 1st and 2nd line support *at one glance* that the problem isn’t the server, etc.. it makes sense to gather the data only once, which gives you breathing room to gather it more intensely. I propose starting at one probe per second, which is peanuts for most modern systems, but which will give you data points at the highest resolution you’ll ever want (you can always average etc.. over longer periods for different uses). I find that services that have issues with one second probes are in deep trouble anyway, and should be rethought. Of course you shouldn’t come up with the heaviest possibly data or query set on purpose. But for normal use.. 1/sec.. is really nothing.
Agent Push really is the only option for system metrics, at that speed, and that’s fine: Provisioning agents is a near zero cost game given DevOps practice, agent push solves the monitoring security issue in one fell swoop, requiring no connections from the monitoring hosts to the agent, hence, no authentication, no technical users, no flaws to exploit, and no endless login/measure/logout sequences wasting CPU slices and network traffic.
I currently favour collectd because it’s fast, light, pluggable, and has a very efficient network protocol. We provision our collectd’s using puppet, and have them push their metrics to multiple monitoring hosts on the Internet every second. Yes, you read that right, collectd uses UDP, so we’re pushing UDP over the Internet. I hear you cry in horror that “Packets may get lost”. Yes they may, and yes they do. But that’s OK, the data will come in a few seconds later. It’s no big deal. We’ve chosen to use the signing and encryption options, because we’re paranoid and proud of it. We have our collectd’s gather all the usual system data, but also application-specific metrics, from applications that support this, and e.g. JVM memory metrics.
The monitoring hosts have collectd instances in listening mode, so they get (most of) the collected data, which gives us the view from *inside* the hosts. Also, the monitoring hosts run any and all kinds of custom service tests, exercising the Internet-published services from the outside. This is the external view: What will the end-user experience. These tests push their results into the same collectd instances, meaning these now have all the relevant metrics.
As much as we love collectd’s efficient binary UDP protocol, we want the simplest possible protocol, and that isn’t.
Using a small collectd write plugin, we write whatever collectd gathers to a multicast group, UDP again, in the simplest format we could find: label-value pairs. The protocol is this:
Metrics are grouped in “shuttles”. Each shuttle consists of a number of lines, followed by a blank line.
Each line consists of a label, an equals sign, a value, and a carriage return.
a label represents a reverse-fqdn of your internet domain, followed by whatever hierarchical representation you see fit. here’s some lines from a shuttle:
be.apsu.prod.eridu.df.var.df_complex.reserved.percentage=5.16165872485 be.apsu.prod.eridu.df.opt.df_complex.reserved.percentage.state=0 be.apsu.prod.eridu.df.opt.df_complex.free.percentage.state.comment=More Than 60% Free Space be.apsu.prod.eridu.df.home.df_complex.reserved.percentage.state=0 be.apsu.prod.eridu.df.tmp.df_complex.reserved.percentage=5.1617400345 be.apsu.prod.eridu.apsu_be.https.httpprobe.responsetime=127.850000 be.apsu.prod.eridu.apsu_be.http.httpprobe.responsetime=49.020000
The plugin adds a timestamp in ms, so every shuttle has one (not shown above)
Since these are multicast (with a ttl of zero), any process on the same monitoring host can join that multicast group and read all the metrics from all the collectd and custom agents. Here’s where filters can clean up the namespace where necessary, contributors can translate values into states and trends, trends into states, states into alerts, and aggregators can contribute calculated values. Contributions just go back into the cauldron. For example, the “percentage” metrics in the example above is contributed by a “df” aggregator which takes reserved, free, and in use metrics and calculates their equivalent percentages.”percentage.state” and “percentage.state.comment” are contributed by a “df.state” contributor that decides which percentage values are OK, for which disks.
We call this multicast group “The cauldron”, since this is where all the ingredients are added and transformed. The nice thing about the multicast group, is that it’s easy to plug into, live, easy to read and write from, by any process, in any language, without interrupting anything else, and we get an extraordinarily robust and proven implementation of it with any GNU/Linux we install.
In the cauldron, any metric (and all it’s derived values, such as states, aggregates, etc..) appear each time the metric is received, and all metric appear, for the entire namespace, so the cauldron may “boil” intensely if you add many metrics. For example, the cauldron on each of the 2 monitoring hosts we’re working on today, “boils” at about 5000 metrics per second. it only looks intense when you look at it. To the machine, that’s only 64Kbyte/sec, even without compression.
To add more hosts, for scaling, we would simply connect them using Ethernet, and set the ttl to 1 instead of 0, to allow the multicast out of the host. But we’re far far away from needing that kind of scaling, at this point.
One type of process in the cauldron allows multiple TCP connections, reads a simple HTTP URL, consisting of the /-separated namespace, and serves shuttles conforming to that URL on a TCP connection, starting off with a complete set of all the current values, followed by updates. This allows any application to subscribe to the metrics it needs, and update a local cache. (Or not. If you were writing that Widget, you might not even keep any cache, just update the widget as the data evolved) We serve this with an apache webserver in front, to handle security and encryption.
Let’s see how idle our CPU’s are for these 2 systems (app1 and app2):
$ wget https://<hidden>/*/cpu/*/cpu/idle/value --user.. hidden.app2.cpu.0.cpu.idle.value=89.2023 hidden.app1.cpu.0.cpu.idle.value=88.32 hidden.app2.cpu.1.cpu.idle.value=99.1911 hidden.app2.cpu.0.cpu.idle.value=91.8071 hidden.app2.cpu.1.cpu.idle.value=99.8242 hidden.app1.cpu.0.cpu.idle.value=93.0782 hidden.app1.cpu.1.cpu.idle.value=93.5785 hidden.app2.cpu.1.cpu.idle.value=97.9927 hidden.app1.cpu.0.cpu.idle.value=86.2266 hidden.app1.cpu.1.cpu.idle.value=96.7542
The first 4 lines are the values at connection time, the rest of the lines are updates.. Since we’re measuring at 1Hz, and CPU values tend to change all the time, we get updates every second.
Let’s look inside an application (we’ve set up collectd to takes these snmp measurements on the server in question)
$ wget https://<hidden>/app1/snmp/counter --user=.. hidden.app1.snmp.counter.validations.value=1.99592 hidden.app1.snmp.counter.cache_misses.value=0 hidden.app1.snmp.counter.cache_hits.value=2.49491 hidden.app1.snmp.counter.cache_refreshes.value=0 hidden.app1.snmp.counter.validations.value=3.48378 hidden.app1.snmp.counter.cache_hits.value=5.47449 hidden.app1.snmp.counter.validations.value=3.51587 hidden.app1.snmp.counter.cache_hits.value=4.52042
All production disk usage, live:
$ wget https://<hidden>/prod/**/df/**/free/percentage--user=..
A python client to this is about 23 lines and uses only standard classes.
Of course we have and we’ll maintain a few reference clients in various languages.
Drawing a full top-level schematic of one’s systems is something I have found both extremely useful and relatively rare. Many systems have grown organically with their organisations and have never even thought of drawing such. This makes it very hard for anyone to get a good idea of how the whole functions, and encourages silo-type thinking with everyone just looking at their little part of the world. While there’s nothing wrong with drawing a partial diagram, I find the minimum should be the “big picture”.
Once you have the “big picture”, why not use it to project monitoring data? In that way, you can immediately tell where the data that you’re looking at fits into the whole, and, from the states of the connected systems, make deductions about what is going on and what the impact is. This allows for faster triage, and for experts in different domains to gather around the same view of the whole system, and look at their own details, while not loosing sight of the connections.
For example: a web service goes “unusable” in the remote functional test, becomes red and glowy on the display. A sysadmin looks at it (and right clicks it to indicate she’s looking into it – see below), zooms in to find the exact measurements, finds that the connection times out. The same functional test, from the inside, that is right next to it, remains OK, responding within a few ms. Triage indicates that this is a network issue, that all the application and backends are fine (and they do show as green), so she gets the network expert to look, he zooms are other parts of the system that are monitored.. If the same service had been merely slow, she would have zoomed in on the warnings in the application server, where she might have found many cache misses, for example, due to some backend problem. If not, she might have asked the developer or application specialist to zoom in on the application metrics.All on the same screen, if required, from anywhere in the world with a reasonable TCP connection, if need be.
The display we’re developing uses SVG to display schematics, with the home view being the largest supersystem that we monitor. Here we see systems and their boundaries, and some services and response times of the most important externally offered services, if any. The response times are live, if we see a bar graph shoot outwards and grow red, we know that service is at least slow. If a host goes yellow, it may have a disk space or CPU usage issue, if an application goes red, it may have fatal errors in it’s log file, JMX, SNMP or other metrics. The point is that SVG are vector graphics, and that we can have any amount of detail hidden in our larger schematic, and zoom into any part to find that details. Host disk space need not be represented large enough to be readable from the home view, as long and the host state shows a problem, we can zoom on it to make more detail visible, while mentally retaining the link between that host, it’s applications, and the whole system. This is a far cry from seeing a red icon next to APPSRVWEB001_VAR_TMP_FREE, and having to mentally make that link.
Also, we can easily give anyone who might want it a read-only view of our systems. It makes a great deal of difference to your customer service experience, if a service desk agent, having the same overview capability, can tell a calling customer, with confidence, where the problem is (not) located, that someone is working on it (and perhaps, who), knowing who to contact, whereas, without that capability, they have to “get back” to the customer, leaving the latter the impression that we’re not monitoring, at all.
When we were provisioning machines manually, it followed that we provisioned monitoring manually, as well. You don’t have the info in a machine-readable format, you cannot parse it.I know organisations that have 2-3 FTE just working on provisioning monitoring solutions (with Web interfaces, click, click click all day, doing the same drudge work).
Now that we’ve moved to automated provisioning, it would make a lot of sense to handle monitoring as much as possible from that same angle. Ideally, what we want is for monitoring to set itself up, from the same machine-readable descriptor that will set up the actual infrastructure to monitor, before that happens. We call this “test-driven infrastructure” just as it is called test-driven-development in the XP methodology: You write a test (but the information is already largely in your puppet or other description), monitoring starts, the infrastructure and all it needs to support appears in the namespace and on the screens, with all the states in ALERT because nothing is working, obviously. Then, as the VM’s, OS, and services appear, states go to OK. At the end, you know your system is OK, because all is green, just as with TDD, you know your code is OK, because all your tests are green.
We haven’t done much on this side of the equation. Much of the extremon configs are still in hard-coded object graphs (designed to be instantiated from textual descriptors, but this is not yet implemented), and so there is a lot of manual provisioning, in there, still.
We don’t have anything of our own, and we don’t want to reinvent the wheel. We’ve connected a carbon engine to the cauldron, and it happily keeps track of and graphs our 5K metrics/sec. (but we had to do some tuning). Ideally, we should get our display code to display those graphs.
I’m consolidating our 4 private github repos to create new, public ones, by or at the #monitoringsux hackathon.
by root with 2 comments