Using Sensu to Monitor, Alert, and Remediate a Heterogeneous Infrastructure

By Ben Parli

In Nitro Engineering we have adopted an All-Ops, Immutable Systems ethos. Of course, adopting the right tools is critical to this culture and products such as Docker and Ansible have served us well in engineering new systems and subsequently pushing features ever faster. However, a challenge remains in how to maintain the legacy systems we’ve inherited which don’t necessarily lend themselves to this culture. We run a lean Engineering team here at Nitro, but this is an added pressure on the typical All-Ops breadth of responsibilities. Even in our furious efforts to refactor these legacy platforms, we must maintain our SLA in the meantime no matter what.

Tool Considerations

With the daily difficulties in keeping the legacy sites up and healthy, we identified the need for a host monitoring and alerting tool. We already leverage third-party tools for instrumentation and time-series application monitoring; our latest pain point is with cleaning up the hosts themselves in many cases. It would have to be flexible and adaptable to support both our current Ubuntu, Docker hosts as well as the legacy Windows environments. It should be open source (as that is also a big part of our ethos), cost effective, ansible compatible, and easy enough to use by any engineer in the company.

Enter Sensu

We evaluated a few products but ultimately settled on Sensu, the modern infrastructure monitoring framework. Sensu checked all the boxes noted above; its open source with a great community, has an abundance of documentation, and is even backwards compatible with nagios plugins so we can start using it right away with minimal effort. Additionally it follows a pub/sub pattern, something our engineers can all fundamentally understand out of the box.

I won’t go into Sensu much since there is already an abundance of great documentation on their site and on the web. The Sensu architecure gif below is a good high level description.
alt text

The only drawbacks to Sensu for us is its dashboarding. For our situation, however, this would only be a luxury anyway. Uchiwa, the open source Sensu dashboarding, allows us a view of the infrastructure state; more than enough given we already use New Relic too.

Implementation Decisions

As mentioned above, our engineering teams already inherently understand the pub/sub approach. Sensu also offers standalone checks which can be scheduled and run from each client instead of scheduled by the server. We decided to keep all Sensu checks scheduled through the server, however. In this way the approach is consistent and all the logic is kept in the Sensu servers instead of the monitored clients. Our Sensu Server directory:

/etc/sensu/
├── conf.d
│   ├── handler-notification.json
│   ├── notification.json
│   ├── checks/
│   ├── client.json
│   ├── redis.json
│   ├── remediator.json
│   └── transport.json
├── dashboard.d
├── extensions
├── handlers
│   ├── handler-pagerduty.rb
│   └── handler-sensu.rb
├── mutators
├── plugins
└── uchiwa.json

Automated Remediation

This is where we really became Sensu fans; keeping the legacy (Windows) environments upright without human intervention. Our primary use case was clearing disk on a Windows host when the legacy application stopped cleaning up after itself. Leveraging a handler from the sensu plugins project, we can auto-remediate by scheduling a powershell cleanup script whenever the disk becomes too full.

To set this up we need only a few more json files and the handler-sensu.rb (from the sensu-plugins project). Our remediator.json simply tells the Sensu server where the handler is and the name of the handler itself (remediator in this case).

{
  "handlers": {
    "remediator": {
      "command": "/etc/sensu/handlers/handler-sensu.rb",
      "type": "pipe",
      "severities": ["critical"]
    }
  }
}

And our check now incorporates the remediator handler. We’ve set this remediation to execute our powershell graceful cleanup script (wrapped in clean.rb on the client Windows host). This will only execute once the check returns a critical severity 2 or more times.

{
  "checks": {
    "check-disk-dpe": {
      "command": "/opt/sensu/embedded/bin/ruby /opt/sensu/embedded/bin/check-windows-disk.rb",
      "subscribers": [
        "dpe"
      ],
      "interval": 60,
      "handlers":["remediator"],
      "remediation": {
        "clear_disk": {
          "occurrences": ["2+"],
          "severities": [2]
        }
      }
    },
    "clear_disk": {
      "command": "/opt/sensu/embedded/bin/ruby /opt/sensu/embedded/bin/clean.rb",
      "subscribers": [],
      "handlers": [],
      "publish": false
    }
  }
}

There is one final key to tying all this together. As noted already, we decided to stick with the pub/sub model so the Sensu server will be scheduling the remediation to the client. For this to work though, the client needs to be subscribed to not only its regular channels, but also to itself. We do this by hostname; in ansible the subscriptions are set in the client.json template as below:

{
  "client": {
    "name": "",
    "address": "",
    "environment": "",
    "subscriptions": [
      "",
      "windows",
      ""
    ],
    "socket": {
      "bind": "127.0.0.1",
      "port": 3030
    }
  },
  "api": {
    "host": "",
    "bind": "0.0.0.0",
    "port": 4567
  }
}

In this way, when the Sensu Server is scheduling a specific remediation, the appropriate client is the only one listening.

In the end, Sensu was the right tool for our immediate pain points and our host infrastructure monitoring and alerting gap. We now have a host infrastructure monitoring solution in place and, when necessary, automated remediation at our disposal.