Rebooting ExtreMon

Aaah, Spring!

After all the initial excitement a few years back, and despite seeing active duty, I have to admit that ExtreMon wasn’t evolving as it should have these past few years.  Suffice it to say I didn’t have the time to work on it, despite the many requests.

https://github.com/m4rienf/ExtreMon

it says “m4rienf authored

https://github.com/m4rienf/ExtreMon-Display

“a year ago”

which is really sad.

So, it being spring, I decided to do something about it. I’m arranging my replacement in other projects in order to focus on ExtreMon, to bring IT monitoring into the 21st Century, and a few other, smaller but related real-time projects, mostly from my Wetteren, Belgium office.

If you don’t remember what ExtreMon is about, allow me to quote myself:

Designers and operators of critical systems like (nuclear) power plants, automated factories, and power distribution grid management installations have always [...]  taken for granted, [...], that information about a system is visualised on an exact schematic representation of that system, and that the information thus visualized is the current, up-to-date state of the system at all times, regardless of the technology used, and regardless of the distance between the system and the operator. [...] What is sought is called Situational Awareness, which we consider requires a monitoring system to be Real-time, Representative, Comprehensive, and to respect the cognitive strengths and limitations of its human operators. The IT sector, despite having successfully supported these other industries for decades and being one of the most impacted by the evolution towards supersystems, remains decades behind in terms of supporting Situation Awareness for its operators, remaining in a dispersed, non-interoperable, out-of-date mode with representations requiring the extensive cognitive efforts much like the dreaded “stove-pipe” accumulation of technologies that plague combat technologies and that military designers have been actively trying to resolve for many years.

Marien, F. (2012). ‘Mending The Cobbler’s Shoes: Supporting Situation Awareness in Distributed IT Supersystems.’. Master’s Thesis submitted to the University Of Liverpool. Available on request.
 

Immediate focus is on installability, converting the Display from Java WebStart to 100% JavaScript, rewriting one or 2 core components in C or C++ for max performance (smaller hardware, less power usage..), and hardware sensors integration (airco, UPS, heating systems).

I’m therefore looking for IT-driven organisations that have a need for schematic, live, internet-wide monitoring, power plant / chemical factory / NORAD style, and are willing to engage in an ideally Agile type process with my company to achieve this. The idea is that ExtreMon is FOSS, and that you will not be paying for the software, only for my time in getting you set up, optionally in managing your monitoring infrastructure, or hosting it, and any custom development that you would need (“custom” means you need something that no other ExtreMon user would be able to use).

If you would be interested or know of organisations that might be, I would very much like to hear about it.

There are a few pilot sites, but I would need several more to make this viable (and to avoid it becoming too focused on a particular organisation’s Modus Operandi). All sorts of cooperation modes are possible.

I need the Field Exposure and, “Frankly”,  you probably need the Situation Awareness!

WKR,

-f

Extremon Unveiled

Ah, “Monitoring”

It certainly means different things to different people:

As sysadmins, we want to know our systems are ok, to the slightest detail, and if not, what is wrong. Preferably before, or at least while it happens. As (enlightened) developers we want to be able to follow our application’s behaviours in production. As service managers we want to know if we’re delivering the service as agreed. What‘s up, down, for how long, how slow, who‘s to blame. As managers we might want to know the bottom line, of how many downloads, sales,..in a pretty Widget. You can probably think of a few more. And you’re probably not happy with that you have (or you wouldn’t have monitoringsux), which is most likely different systems runing besides each other, with different paradigms, platform-dependent API’s (if any), Different Web GUI with lineair lists and state colors, that you force to refresh ever few minutes, and even then look at old information, and that have credentials to authenticate to the systems they monitor, making them dangerous points of failure in terms of security.

Depending on what you’re trying to monitor, you may be OK with all of these.

But if you’re like us, you’ll end up in a multi-everything (platform, application, networks, silos, sites, policies) environment with no end of interdependencies, where at least some applications are interactive and time-critical, and the sysadmins and developers are collegues, horizontal team members, or all the same people. This is the type of environment that we’re growing Extremon for.

Taking 3 important headlines from the  Extreme Monitoring Manifesto

Live, with Subsecond temporal resolution

Most of the data you’re gathering will be required for different purposes. The service response time and validity you’re testing in a functional test tells the sysadmins in the data center that it’s fast enough, tells the developer that his caching strategy works in production, tells the service manager that you’re ok with the SLA, tells the 1st and 2nd line support *at one glance* that the problem isn’t the server, etc.. it makes sense to gather the data only once, which gives you breathing room to gather it more intensely. I propose starting at one probe per second, which is peanuts for most modern systems, but which will give you data points at the highest resolution you’ll ever want (you can always average etc.. over longer periods for different uses). I find that services that have issues with one second probes are in deep trouble anyway, and should be rethought. Of course you shouldn’t come up with the heaviest possibly data or query set on purpose. But for normal use.. 1/sec.. is really nothing.

Agent Push really is the only option for system metrics, at that speed, and that’s fine: Provisioning agents is a near zero cost game given DevOps practice, agent push solves the monitoring security issue in one fell swoop, requiring no connections from the monitoring hosts to the agent, hence, no authentication, no technical users, no flaws to exploit, and no endless login/measure/logout sequences wasting CPU slices and network traffic.

I currently favour collectd because it’s fast, light, pluggable, and has a very efficient network protocol. We provision our collectd’s using puppet, and have them push their metrics to multiple monitoring hosts on the Internet every second. Yes, you read that right, collectd uses UDP, so we’re pushing UDP over the Internet. I hear you cry in horror that “Packets may get lost”. Yes they may, and yes they do. But that’s OK, the data will come in a few seconds later. It’s no big deal. We’ve chosen to use the signing and encryption options, because we’re paranoid and proud of it. We have our collectd’s gather all the usual system data, but also application-specific metrics, from applications that support this, and e.g. JVM memory metrics.

The monitoring hosts have collectd instances in listening mode, so they get (most of) the collected data, which gives us the view from *inside* the hosts. Also, the monitoring hosts run any and all kinds of custom service tests, exercising the Internet-published services from the outside. This is the external view: What will the end-user experience. These tests push their results into the same collectd instances, meaning these now have all the relevant metrics.

Hot-pluggable components

As much as we love collectd’s efficient binary UDP protocol, we want the simplest possible protocol, and that isn’t.

Using a small collectd write plugin, we write whatever collectd gathers to a multicast group, UDP again, in the simplest format we could find: label-value pairs. The protocol is this:

Metrics are grouped in “shuttles”. Each shuttle consists of a number of lines, followed by a blank line.
Each line consists of a label, an equals sign, a value, and a carriage return.

a label represents a reverse-fqdn of your internet domain, followed by whatever hierarchical representation you see fit. here’s some lines from a shuttle:

be.apsu.prod.eridu.df.var.df_complex.reserved.percentage=5.16165872485 
be.apsu.prod.eridu.df.opt.df_complex.reserved.percentage.state=0 
be.apsu.prod.eridu.df.opt.df_complex.free.percentage.state.comment=More Than 60% Free Space 
be.apsu.prod.eridu.df.home.df_complex.reserved.percentage.state=0 
be.apsu.prod.eridu.df.tmp.df_complex.reserved.percentage=5.1617400345 
be.apsu.prod.eridu.apsu_be.https.httpprobe.responsetime=127.850000 
be.apsu.prod.eridu.apsu_be.http.httpprobe.responsetime=49.020000

The plugin adds a timestamp in ms, so every shuttle has one (not shown above)

Since these are multicast (with a ttl of zero), any process on the same monitoring host can join that multicast group and read all the metrics from all the collectd and custom agents. Here’s where filters can clean up the namespace where necessary, contributors can translate values into states and trends, trends into states, states into alerts, and aggregators can contribute calculated values. Contributions just go back into the cauldron. For example, the “percentage” metrics in the example above is contributed by a “df” aggregator which takes reserved, free, and in use metrics and calculates their equivalent percentages.”percentage.state” and “percentage.state.comment” are contributed by a “df.state” contributor that decides which percentage values are OK, for which disks.

We call this multicast group “The cauldron”, since this is where all the ingredients are added and transformed. The nice thing about the multicast group, is that it’s easy to plug into, live, easy to read and write from, by any process, in any language, without interrupting anything else, and we get an extraordinarily robust and proven implementation of it with any GNU/Linux we install.

In the cauldron, any metric (and all it’s derived values, such as states, aggregates, etc..) appear each time the metric is received, and all metric appear, for the entire namespace, so the cauldron may “boil” intensely if you add many metrics. For example, the cauldron on each of the 2 monitoring hosts we’re working on today, “boils” at about 5000 metrics per second. it only looks intense when you look at it. To the machine, that’s only 64Kbyte/sec, even without compression.

To add more hosts, for scaling, we would simply connect them using Ethernet, and set the ttl to 1 instead of 0, to allow the multicast out of the host. But we’re far far away from needing that kind of scaling, at this point.

Simple Text-based Internet-Friendly Subscription Push API

One type of process in the cauldron allows multiple TCP connections, reads a simple HTTP URL, consisting of the /-separated namespace, and serves shuttles conforming to that URL on a TCP connection, starting off with a complete set of all the current values, followed by updates. This allows any application to subscribe to the metrics it needs, and update a local cache. (Or not. If you were writing that Widget, you might not even keep any cache, just update the widget as the data evolved) We serve this with an apache webserver in front, to handle security and encryption.

Let’s see how idle our CPU’s are for these 2 systems (app1 and app2):

$ wget https://<hidden>/*/cpu/*/cpu/idle/value --user.. 

hidden.app2.cpu.0.cpu.idle.value=89.2023 
hidden.app1.cpu.0.cpu.idle.value=88.32 
hidden.app2.cpu.1.cpu.idle.value=99.1911 
hidden.app2.cpu.0.cpu.idle.value=91.8071 
hidden.app2.cpu.1.cpu.idle.value=99.8242 
hidden.app1.cpu.0.cpu.idle.value=93.0782 
hidden.app1.cpu.1.cpu.idle.value=93.5785 
hidden.app2.cpu.1.cpu.idle.value=97.9927 
hidden.app1.cpu.0.cpu.idle.value=86.2266 
hidden.app1.cpu.1.cpu.idle.value=96.7542

The first 4 lines are the values at connection time, the rest of the lines are updates.. Since we’re measuring at 1Hz, and CPU values tend to change all the time, we get updates every second.

Let’s look inside an application (we’ve set up collectd to takes these snmp measurements on the server in question)

$ wget https://<hidden>/app1/snmp/counter  --user=.. 
hidden.app1.snmp.counter.validations.value=1.99592 
hidden.app1.snmp.counter.cache_misses.value=0 
hidden.app1.snmp.counter.cache_hits.value=2.49491 
hidden.app1.snmp.counter.cache_refreshes.value=0 
hidden.app1.snmp.counter.validations.value=3.48378 
hidden.app1.snmp.counter.cache_hits.value=5.47449 
hidden.app1.snmp.counter.validations.value=3.51587 
hidden.app1.snmp.counter.cache_hits.value=4.52042

All production disk usage, live:

$ wget https://<hidden>/prod/**/df/**/free/percentage--user=..

etc.. etc..

A python client to this is about 23 lines and uses only standard classes.
Of course we have and we’ll maintain a few reference clients in various languages.

Display on a meaningful representation, and in real-time.

Web Pages were intented to convey static documents with links between them. Stretching that metaphor only goes so far, and I don’t think web pages are an appropriate medium to convey real-time data (but that’s my opinion (fr4nkm), koendc has different ideas, and is working on Javascript-based clients, which, I must say, look pretty impressive) Also, we want a “meaningful representation” which implies that for anything more complex than a single server we want to get away from HTML-driven lists and status colours, and move to a full schematic of the systems we’re monitoring, and their connections.

Drawing a full top-level schematic of one’s systems is something I have found both extremely useful and relatively rare. Many systems have grown organically with their organisations and have never even thought of drawing such. This makes it very hard for anyone to get a good idea of how the whole functions, and encourages silo-type thinking with everyone just looking at their little part of the world. While there’s nothing wrong with drawing a partial diagram, I find the minimum should be the “big picture”.

Once you have the “big picture”, why not use it to project monitoring data? In that way, you can immediately tell where the data that you’re looking at fits into the whole, and,  from the states of the connected systems, make deductions about what is going on and what the impact is. This allows for faster triage, and for experts in different domains to gather around the same view of the whole system, and look at their own details, while not loosing sight of the connections.

For example: a web service goes “unusable” in the remote functional test, becomes red and glowy on the display. A sysadmin looks at it (and right clicks it to indicate she’s looking into it – see below), zooms in to find the exact measurements, finds that the connection times out. The same functional test, from the inside, that is right next to it, remains OK, responding within a few ms. Triage indicates that this is a network issue, that all the application and backends are fine (and they do show as green), so she gets the network expert to look, he zooms are other parts of the system that are monitored.. If the same service had been merely slow, she would have zoomed in on the warnings in the application server, where she might have found many cache misses, for example, due to some backend problem. If not, she might have asked the developer or application specialist to zoom in on  the application metrics.All on the same screen, if required, from anywhere in the world with a reasonable TCP connection, if need be.

The display we’re developing uses SVG to display schematics, with the home view being the largest supersystem that we monitor. Here we see systems and their boundaries, and some services and response times of the most important externally offered services, if any. The response times are live, if we see a bar graph shoot outwards and grow red, we know that service is at least slow. If a host goes yellow, it may have a disk space or CPU usage issue, if an application goes red, it may have fatal errors in it’s log file, JMX, SNMP or other metrics. The point is that SVG are vector graphics, and that we can have any amount of detail hidden in our larger schematic, and zoom into any part to find that details. Host disk space need not be represented large enough to be readable from the home view, as long and the host state shows a problem, we can zoom on it to make more detail visible, while mentally retaining the link between that host, it’s applications, and the whole system. This is a far cry from seeing a red icon next to APPSRVWEB001_VAR_TMP_FREE, and having to mentally make that link.

Also, we can easily give anyone who might want it a read-only view of our systems. It makes a great deal of difference to your customer service experience, if a service desk agent, having the same overview capability, can tell a calling customer, with confidence, where the problem is (not) located, that someone is working on it (and perhaps, who), knowing who to contact, whereas, without that capability, they have to “get back” to the customer, leaving the latter the impression that we’re not monitoring, at all.

home view overview, with some functional tests

zoomed in on a host, CPU and disks

Detail of 3 functional tests

Implicit Provisioning (Test-driven infrastructure)

When we were provisioning machines manually, it followed that we provisioned monitoring manually, as well. You don’t have the info in a machine-readable format, you cannot parse it.I know organisations that have 2-3 FTE just working on provisioning monitoring solutions (with Web interfaces, click, click click all day, doing the same drudge work).

Now that we’ve moved to automated provisioning, it would make a lot of sense to handle monitoring as much as possible from that same angle. Ideally, what we want is for monitoring to set itself up, from the same machine-readable descriptor that will set up the actual infrastructure to monitor, before that happens. We call this “test-driven infrastructure” just as it is called test-driven-development in the XP methodology: You write a test (but the information is already largely in your puppet or other description), monitoring starts, the infrastructure and all it needs to support appears in the namespace and on the screens, with all the states in ALERT because nothing is working, obviously. Then, as the VM’s, OS, and services appear, states go to OK. At the end, you know your system is OK, because all is green, just as with TDD, you know your code is OK, because all your tests are green.

We haven’t done much on this side of the equation. Much of the extremon configs are still in hard-coded object graphs (designed to be instantiated from textual descriptors, but this is not yet implemented), and so there is a lot of manual provisioning, in there, still.

Graphing

We don’t have anything of our own, and we don’t want to reinvent the wheel. We’ve connected a carbon engine to the cauldron, and it happily keeps track of and graphs our 5K metrics/sec. (but we had to do some tuning). Ideally, we should get our display code to display those graphs.

 

Schematic Overview

Rough Overview

The Code So Far

I’m consolidating our 4 private github repos to create new, public ones, by or at the #monitoringsux hackathon.

Done: https://github.com/m4rienf/ExtreMon-Display
Done: https://github.com/m4rienf/ExtreMon
ToDo: Koen’s Javascript clients, Java browser namespace browser applet