Busy CPU Test - One possibility/example

General chat about HostMonitor
Post Reply
timn
Posts: 184
Joined: Thu Nov 20, 2003 9:57 am
Location: United States

Busy CPU Test - One possibility/example

Post by timn »

I'm just beginning to play with alert profiles and have been impressed with how flexible AHM is. I wanted to share a recent success involving testing for extremely busy CPU utilization.

Many of you have already gone far beyond this -- you should skip the remainder of this message if you long ago passed AHM 101.

Here's the scenario:
  1. Normally, I want to test a host's CPU utilization rate about once per minute.
  2. It is common on some of our hosts for CPU utilization to go to 100% for breif periods (typically 10 seconds or less). So I don't want to be alerted when this happens. i.e. I don't want to panic the 1st time I see 100% utilization.
  3. But if I see 100% utilization for 2 times in a row, I'd like to focus in on that test and run it more frequently, say once every 5 seconds until utilization stabilizes at something under 100%
  4. If the CPU stay at 100% for more than a minute (12 checks at 5 second intervals), I'd like to get an email notification
  5. I only want to go back to my original test interval if I get 4 'good' test results in a row -- indicating CPU utilization has dropped below 100% and maintained that value for 20 seconds.
This is surprisingly easy to achieve in AHM by building an Alert Profile.

First, create a CPU utilization test that runs once per minute and considers reply values greater than 99 to be bad.

Next, in the Test Properties dialog, under Alerts, click on Configure.

In the Action Profiles dialog, click on New and name your profile anything you want (I used "Busy CPU Watch")

Next, we are going to add 2 "Bad" status actions and 1 "Good" status action.

Under "Bad" status actions, click on "Add" and select "Change Test Interval".

In the Action Properties dialog, under "Condition to start action", set
"Start when" 2 "consecutive bad results occurs". Also, under Action Parameters, check the "Set to" line and enter 00:00:05 Then click 'OK'

This action says the when 2 'bad' results occur (at the original interval of once per minute), change the test interval to once every 5 seconds.

Now we want to state that if 12 total 'bad results are received, send an email notification. (Note: this is 2 'bad' results at the original interval and 10 more at the increased interval).

Under "Bad" status actions, click on "Add" and select "Send E-MAil (SMTP)". In the Action Properties dialog, configure all the parameters required for an email message. (I chose to make my message body template a bit generic using macro variables so that I could re-use the template for other hosts when their CPUs are in distress.)

Finally, we need to tell AHM that when 4 consecutive 'good' results are seen, reset the test interval to its original value.

Under "Good" status actions, click on "Add" and select "Change Test Interval".

In the Action Properties dialog, under "Condition to start action", set
"Start when" 4 "consecutive Good results occurs". Also, under Action Parameters, check the "Restore original value" box. Then click 'OK'

That's all. You can easily 'test' your alert by checking the "Reverse alert" check box in the Test Properties dialog. (Remember to uncheck it when you are done.)

Side notes:

I am running RMA on the remote host.

I decided to treat unknown statuses as 'bad'. This CPU test is dependent upon a Master ping test of the host. If the host cannot be pinged, the CPU test will not run.

But if the host can be pinged then it might be so busy that the RMA never gets a chance to respond, thus treating unknown statuses as "bad" is probably the right thing to do here. Your mileage may vary.
FLynch
Posts: 75
Joined: Tue Jun 18, 2002 6:00 pm
Location: London UK

Interesting

Post by FLynch »

Hi,

Looks a good methodology to stop alert blizzards.

Also a good idea to post this sort of usage on the Forum.....encourages more innovative use of AHM.

Cheers
Post Reply