Linux – I keep my silent watch and ward

Basic shell commands to keep an eye on the health of an HP Red Hat Enterprise Linux server

Charge by the hour I may, but it does neither me nor the client any good to be rebuilding a live server while the users have nothing better to do than look for someone else to spend their money with. There are loads of tools and frameworks out there to keep a watchful eye on server health, but for monitoring one or two live servers for a small business they’re a bit of a sledgehammer to crack a nut. And who monitors the monitoring system?

These days most such servers are hosted by data-centres who will be monitoring their health anyway. And increasingly they’re virtualised too, so hardware monitoring isn’t something you need to worry about (or can do much about). But for physical servers, relying entirely on the hosting provider’s diligence puts the stability and availability of your live systems outside of your direct control. Anyone who has been round the block a few times in this business knows that this is going to go wrong sooner or later. After all, everything does!

So I always like to set up a simple script to “keep a silent watch and ward” to slightly mis-quote Gilbert and Sullivan’s “The Yeomen of the Guard”. Porting a client’s Red Hat Enterprise Linux server from Dell to HP hardware recently had me digging around for a new set of commands to use.

Keeping it simple

On the old Dell server I had a simple hourly cron job set up which used the rather nifty omreport command to check for any critical events in the Dell alert log. This command generates a series of semi-colon separated lines like the following:

Grepping for lines starting Critical and emailing the matches provides a quick and simple heads-up if anything is going awry:

Further grepping for only those messages generated in the last couple of hours or the current day is enough to keep these alerts focussed without jumping through too many scripting hoops, though creating a timestamp file and only emailing alerts with a more recent timestamp would be an easy way of avoiding duplicates.

The HP way

A bit of light Googling found an HP equivalent hplog command, though the output is slightly less grep-friendly:

As I’ve less history with this new server I decided to capture all Caution and Critical messages and use awk’s custom record separators to handle the multi-line output:

Grepping for a date doesn’t work here because of the multi-line output, but passing in a partial date for awk to search for does the same thing:

Grepping for a specific day’s date, or a specific hour, always leaves a hole between the cron running and the end of the period you’re grepping for, so it’s prudent to search for the previous hour or previous day too.

One nice feature of hplog is that it lets you write your own messages to the alert log:

So I can test my scripts on a previously untroubled system:

 Belt and braces

I haven’t historical logs on this system to prove that disk health problems will be recorded in this log file. While I presume they will be I decided to hunt down a command which would specifically check the health of the disks. The somewhat obscurely named hpacucli seems to give me what I need:

A simple grep of the output for physicaldrive or logicaldrive lines which don’t contain OK to include in my email seems to do the trick:

Powerful monitoring tools are great but they take time to manage and configure and the more complex they are then the more points of failure they themselves introduce. A few lines of shell script running on a regular cron is a comforting cross-check to have in place, and sometimes really as much as you need.

 

Leave a Reply

Your email address will not be published. Required fields are marked *