Traffic Monitor Reports Incorrect Values?

All questions related to installations, configurations and maintenance of Advanced Host Monitor (including additional tools such as RMA for Windows, RMA Manager, Web Servie, RCC).
xcentric
Posts: 176
Joined: Sat Oct 23, 2010 4:30 pm

Post by xcentric »

Ok Now I understand. Need sleep.

The sysUpTime test was set for every 5 seconds.

Is it me or should these samples be sequential?
3/13/2011 12:51:47 AM Ok 95690018
3/13/2011 12:51:50 AM Ok 333165314
3/13/2011 12:51:52 AM Ok 95690525
3/13/2011 12:51:57 AM Ok 95691033
3/13/2011 12:51:58 AM Unknown RMA: 301 - Timeout
3/13/2011 12:52:02 AM Ok 95691540
3/13/2011 12:52:04 AM Ok 333166719
3/13/2011 12:52:07 AM Ok 95692047
3/13/2011 12:52:10 AM Ok 333167332

3/13/2011 1:51:52 AM Ok 333524860
3/13/2011 1:51:57 AM Ok 96050443
3/13/2011 1:51:58 AM Ok 333525465
3/13/2011 1:52:02 AM Ok 96050950
3/13/2011 1:52:06 AM Unknown RMA: 301 - Timeout
3/13/2011 1:52:07 AM Ok 96051457
3/13/2011 1:52:12 AM Ok 96051963
3/13/2011 1:52:12 AM Ok 333526879
3/13/2011 1:52:17 AM Ok 96052472

3/13/2011 9:52:10 AM Ok 98571898
3/13/2011 9:52:13 AM Ok 336046572
3/13/2011 9:52:15 AM Ok 98572407
3/13/2011 9:52:20 AM Ok 98572913
3/13/2011 9:52:21 AM Unknown RMA: 301 - Timeout
3/13/2011 9:52:26 AM Ok 98573422
3/13/2011 9:52:27 AM Ok 336047987
3/13/2011 9:52:31 AM Ok 98573928
3/13/2011 9:52:33 AM Ok 336048615

3/13/2011 11:52:11 AM Ok 99292028
3/13/2011 11:52:16 AM Ok 99292535
3/13/2011 11:52:16 AM Ok 336766870
3/13/2011 11:52:21 AM Ok 99293043
3/13/2011 11:52:25 AM Unknown RMA: 301 - Timeout
3/13/2011 11:52:26 AM Ok 99293550
3/13/2011 11:52:31 AM Ok 336768295
3/13/2011 11:52:31 AM Ok 99294057
3/13/2011 11:52:36 AM Ok 99294563

3/13/2011 2:52:17 PM Ok 100372635
3/13/2011 2:52:22 PM Ok 100373147
3/13/2011 2:52:22 PM Ok 337847285
3/13/2011 2:52:27 PM Ok 100373653
3/13/2011 2:52:31 PM Unknown RMA: 301 - Timeout
3/13/2011 2:52:33 PM Ok 100374240
3/13/2011 2:52:37 PM Ok 337848787
3/13/2011 2:52:38 PM Ok 100374748
3/13/2011 2:52:43 PM Ok 100375258

3/13/2011 4:52:25 PM Ok 101093497
3/13/2011 4:52:27 PM Ok 338567702
3/13/2011 4:52:30 PM Ok 101094003
3/13/2011 4:52:35 PM Ok 101094510
3/13/2011 4:52:35 PM Unknown RMA: 301 - Timeout
3/13/2011 4:52:40 PM Ok 101095018
3/13/2011 4:52:41 PM Ok 338569127
3/13/2011 4:52:45 PM Ok 101095525
3/13/2011 4:52:47 PM Ok 338569737

3/13/2011 7:52:27 PM Ok 102173783
3/13/2011 7:52:32 PM Ok 102174290
3/13/2011 7:52:32 PM Ok 339648104
3/13/2011 7:52:37 PM Ok 102174800
3/13/2011 7:52:41 PM Unknown RMA: 301 - Timeout
3/13/2011 7:52:42 PM Ok 102175307
3/13/2011 7:52:47 PM Ok 339649530
3/13/2011 7:52:47 PM Ok 102175813
3/13/2011 7:52:53 PM Ok 102176320

3/13/2011 9:52:31 PM Ok 340367845
3/13/2011 9:52:36 PM Ok 102894673
3/13/2011 9:52:37 PM Ok 340368459
3/13/2011 9:52:41 PM Ok 102895178
3/13/2011 9:52:45 PM Unknown RMA: 301 - Timeout
3/13/2011 9:52:46 PM Ok 102895685
3/13/2011 9:52:51 PM Ok 102896192
3/13/2011 9:52:51 PM Ok 340369879
3/13/2011 9:52:56 PM Ok 102896700
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Yes, this counter should be incremented by 100 every second. So, looks like there is some problem with software installed on this device (hardware firewall SonicWALL TZ100?). I would suggest to contact support team, may be they have update...

Regards
Alex
xcentric
Posts: 176
Joined: Sat Oct 23, 2010 4:30 pm

Post by xcentric »

On the contrary I found out something interesting.

I had 2 seperate smtp get tests "one each per two different firewalls" for sysUpTime. The test names were the "same" for both.

The values for sysUpTime I just posted came from the common log and were displayed in Log Analyzer. What I discovered is Log Analyzer displays the two seperate sysUpTime tests as one.

Is this because the 2 test names were the same perhaps?
Should Log Analyzer use the test id's instead when displaying data?

This means that what is beng shown in my previous post is an aggregate log alternating between the 2 different sysUpTime tests.

I did not remember when posting previously but I had created seprate "private" logs for both tests. Correct values are hown below for the specified times.

3/13/2011 12:51:32 AM Ok 333163490
3/13/2011 12:51:38 AM Ok 333164095
3/13/2011 12:51:44 AM Ok 333164707
3/13/2011 12:51:50 AM Ok 333165314
3/13/2011 12:51:58 AM Unknown RMA: 301 - Timeout
3/13/2011 12:52:04 AM Ok 333166719
3/13/2011 12:52:10 AM Ok 333167332
3/13/2011 12:52:16 AM Ok 333167937
3/13/2011 12:52:23 AM Ok 333168545

3/13/2011 1:51:40 AM Ok 333523637
3/13/2011 1:51:46 AM Ok 333524249
3/13/2011 1:51:52 AM Ok 333524860
3/13/2011 1:51:58 AM Ok 333525465
3/13/2011 1:52:06 AM Unknown RMA: 301 - Timeout
3/13/2011 1:52:12 AM Ok 333526879
3/13/2011 1:52:18 AM Ok 333527484
3/13/2011 1:52:25 AM Ok 333528095
3/13/2011 1:52:31 AM Ok 333528740

3/13/2011 9:51:54 AM Ok 336044740
3/13/2011 9:52:00 AM Ok 336045352
3/13/2011 9:52:06 AM Ok 336045960
3/13/2011 9:52:13 AM Ok 336046572
3/13/2011 9:52:21 AM Unknown RMA: 301 - Timeout
3/13/2011 9:52:27 AM Ok 336047987
3/13/2011 9:52:33 AM Ok 336048615
3/13/2011 9:52:39 AM Ok 336049232
3/13/2011 9:52:45 AM Ok 336049814

3/13/2011 11:51:58 AM Ok 336765049
3/13/2011 11:52:04 AM Ok 336765654
3/13/2011 11:52:10 AM Ok 336766264
3/13/2011 11:52:16 AM Ok 336766870
3/13/2011 11:52:25 AM Unknown RMA: 301 - Timeout
3/13/2011 11:52:31 AM Ok 336768295
3/13/2011 11:52:37 AM Ok 336768897
3/13/2011 11:52:43 AM Ok 336769510
3/13/2011 11:52:49 AM Ok 336770115

3/13/2011 2:52:02 PM Ok 337845335
3/13/2011 2:52:10 PM Ok 337846060
3/13/2011 2:52:16 PM Ok 337846687
3/13/2011 2:52:22 PM Ok 337847285
3/13/2011 2:52:31 PM Unknown RMA: 301 - Timeout
3/13/2011 2:52:37 PM Ok 337848787
3/13/2011 2:52:43 PM Ok 337849404
3/13/2011 2:52:49 PM Ok 337850015
3/13/2011 2:52:55 PM Ok 337850622

3/13/2011 4:52:09 PM Ok 338565879
3/13/2011 4:52:15 PM Ok 338566489
3/13/2011 4:52:21 PM Ok 338567095
3/13/2011 4:52:27 PM Ok 338567702
3/13/2011 4:52:35 PM Unknown RMA: 301 - Timeout
3/13/2011 4:52:41 PM Ok 338569127
3/13/2011 4:52:47 PM Ok 338569737
3/13/2011 4:52:53 PM Ok 338570340
3/13/2011 4:53:00 PM Ok 338570957

3/13/2011 7:52:14 PM Ok 339646277
3/13/2011 7:52:20 PM Ok 339646884
3/13/2011 7:52:26 PM Ok 339647500
3/13/2011 7:52:32 PM Ok 339648104
3/13/2011 7:52:41 PM Unknown RMA: 301 - Timeout
3/13/2011 7:52:47 PM Ok 339649530
3/13/2011 7:52:53 PM Ok 339650137
3/13/2011 7:52:59 PM Ok 339650742
3/13/2011 7:53:05 PM Ok 339651347

3/13/2011 9:52:19 PM Ok 340366634
3/13/2011 9:52:25 PM Ok 340367250
3/13/2011 9:52:31 PM Ok 340367845
3/13/2011 9:52:37 PM Ok 340368459
3/13/2011 9:52:45 PM Unknown RMA: 301 - Timeout
3/13/2011 9:52:51 PM Ok 340369879
3/13/2011 9:52:57 PM Ok 340370484
3/13/2011 9:53:03 PM Ok 340371090
3/13/2011 9:53:09 PM Ok 340371700
xcentric
Posts: 176
Joined: Sat Oct 23, 2010 4:30 pm

Post by xcentric »

Possible Resolution

As a test I increased the polling interval from 5 seconds to 1 minute on my traffic tests. I learned that every devices snmp agent will update differently. Meaning some agents dont update for less than a certain elapsed time has passed. An example would be 15 to 20 seconds. And I realize that monitoring every 5 seconds is a more acurate representation but it is not practical.

This made a big difference as now there are no timeouts on the one friewall in question. However on the other I noticed te following.

The below values are after I changed the interval to 1 minute now satifying the snmp agent's ability to fully update.

So I understand correctly how is the "Bad" value calculated after a timeout?
And would it make sense to cache the last value after a timeout so it doesnt input eroneous data like the value below of 1627 Mbit. Man, I wish I could have that bandwidth dont get me wrong. :D

3/15/2011 8:49:04 PM Ok 0.04 Mbit
3/15/2011 8:50:05 PM Ok 0.05 Mbit
3/15/2011 8:51:06 PM Ok 0.04 Mbit
3/15/2011 8:52:07 PM Ok 0.04 Mbit
3/15/2011 8:53:12 PM Unknown Timeout
3/15/2011 8:54:16 PM Host is alive
3/15/2011 8:55:16 PM Bad 1627.06 Mbit
3/15/2011 8:56:17 PM Ok 0.04 Mbit
3/15/2011 8:57:18 PM Ok 0.04 Mbit
3/15/2011 8:58:19 PM Ok 0.04 Mbit
3/15/2011 8:59:20 PM Ok 0.04 Mbit
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

And would it make sense to cache the last value after a timeout
We checked code and make some tests... As we see HostMonitor works correctly. When HostMonitor or agent cannot retrieve counters, it just keeps previous counters value so test will be able to calculate traffic when problem will be fixed. You may restart agents or replace Passive RMA with Active RMA and test will continue to work fine without losing statistics.

Why this does not work with your device? I don't know :(
I do not see any problem with sysUpTime counter (after you split logs), I see normal increments every 5-6 sec.
So there must be some problems with network interface related counters...

I assume you are using single test to check all interfaces on this device?
Then I have a theory. If something happens and device "lose" some interface (do not provide statistics on interface or provide 0 data on interface) then "total traffic" value provided to HostMonitor can be less than value returned by previous check. HostMonitor takes this as device reboot or device failure and sets "Host is alive" status. Then if device fixes itself and begins to provide data on all interfaces including interface that was missed 5 min ago and device does not reset counters for this "alive->dead/missed->alive" interface then we will see what you see.
If there are not too many interfaces on this device, I think you may setup separate test for each interface and see what happens.

BTW
What happens at 9:52, 11:52, 2:52, 4:52? Why some network problem (delays?) happens on such regular basis?

Regards
Alex
xcentric
Posts: 176
Joined: Sat Oct 23, 2010 4:30 pm

Post by xcentric »

I assume you are using single test to check all interfaces on this device?
No, a single test against a single interface on the device. In this case the WAN interface.
Then I have a theory. If something happens and device "lose" some interface (do not provide statistics on interface or provide 0 data on interface) then "total traffic" value provided to HostMonitor can be less than value returned by previous check. HostMonitor takes this as device reboot or device failure and sets "Host is alive" status. Then if device fixes itself and begins to provide data on all interfaces including interface that was missed 5 min ago and device does not reset counters for this "alive->dead/missed->alive" interface then we will see what you see.
This is a good theory if monitoring on all interfaces. Unfortunately, I monitor the interfaces seperately.
If there are not too many interfaces on this device, I think you may setup separate test for each interface and see what happens.
This is how it is currently setup.
BTW
What happens at 9:52, 11:52, 2:52, 4:52? Why some network problem (delays?) happens on such regular basis?
I wish I knew. But as I mentioned in my previous post. The problem has disapeared since increasing the interval from 5 seconds to 1 minute.

Theory: The snmp agent on the device must not be able to handle such a short interval (5 seconds). Maybe it took a couple of hours to come to a full cycle before timing out.

This still doesnt explain why the other firewall still has a timeout every now and then. Still looking into this.

I do appreciate your insight. Thank you very much.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Theory: The snmp agent on the device must not be able to handle such a short interval (5 seconds). Maybe it took a couple of hours to come to a full cycle before timing out
Then agent would keep old counters and HostMonitor would show 0 traffic...
This is a good theory if monitoring on all interfaces. Unfortunately, I monitor the interfaces seperately
H'm, than I have no idea what is going on :roll:

Regards
Alex
xcentric
Posts: 176
Joined: Sat Oct 23, 2010 4:30 pm

Post by xcentric »

I'm sure the issue will eventually be discovered. I will have to revisit with a fresh prespective later. For now, I'm good to go. I can deal with an infrequent timeout once in a while. I will update when I find out more.
xcentric
Posts: 176
Joined: Sat Oct 23, 2010 4:30 pm

Post by xcentric »

To rule out our firewall devices or brand of device I setup a traffic test on our switch which is directly connected to the HM server and so far today we have received one bad bandwidth value.

The port on the switch monitored is connected to the LAN interface of the firewall. So this is the gateway port to the internet.

Question.
Can we assume the issue may be within the server itself if not within HM?

Traffic In from port 1 on switch
3/18/2011 1:49:37 AM Ok 0.05 Mbit
3/18/2011 1:50:38 AM Ok 0.04 Mbit
3/18/2011 1:51:38 AM Ok 0.05 Mbit
3/18/2011 1:52:39 AM Ok 0.05 Mbit
3/18/2011 1:54:25 AM Host is alive
3/18/2011 1:55:26 AM Ok 0.05 Mbit
3/18/2011 1:56:32 AM Host is alive
3/18/2011 1:57:35 AM Host is alive
3/18/2011 1:58:35 AM Ok 546.24 Mbit
3/18/2011 1:59:36 AM Ok 0.15 Mbit
3/18/2011 2:00:36 AM Ok 0.15 Mbit
3/18/2011 2:01:37 AM Ok 0.25 Mbit
3/18/2011 2:02:38 AM Ok 0.07 Mbit
xcentric
Posts: 176
Joined: Sat Oct 23, 2010 4:30 pm

Post by xcentric »

So I increased the timeout from 2000 to 2500 ad set the retry from 1 to 2 and now my traffic monitor reports correctly. The ststistics now match between the switch port and the wan interface port on the firewall.

Cheked the logs and still see 1 or two "Host is alive" but after no more crazy values like 1654 Mbits. So it is at least reliable now. I think the problem is in the server. I see the "Host is alive" at the exact same time in the log comparing 2 completely different pieces of hardware (switch and firewall). So it is not the firewall or the switch.

I'm going to try and move the install to another server and see what happens.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

I see the "Host is alive" at the exact same time in the log comparing 2 completely different pieces of hardware (switch and firewall). So it is not the firewall or the switch.
:roll:
I think you may try to setup SNMP Get test to retrieve raw counters without any processing on HostMonitor side.
1.3.6.1.2.1.31.1.1.1.6.index = In Octets
1.3.6.1.2.1.31.1.1.1.10.index = Out Octets
What index should be used for the interface can be checked by using "Choose network interface" dialog

Regards
Alex
xcentric
Posts: 176
Joined: Sat Oct 23, 2010 4:30 pm

Post by xcentric »

Tests created. Will report back in a day or two.
xcentric
Posts: 176
Joined: Sat Oct 23, 2010 4:30 pm

Post by xcentric »

Well after what seemed like an eternity I now know the cause of this mess.

I switched all of my WAN traffic tests from snmp v2c to v1.
Monitoring server interfaces does not seem to be affected with this problem. I have not had a single bad reading since.

Even though the routers used (SonicWALL) support 64 bit counters there is something happening with their implementation that misbehaves with hm. I have no other routers to test. Cisco probably is unaffected. :roll:

I did read that if your bandwidth consumption is below a certain threshold you should ONLY use 32 bit counters and likewise above a certain threshold 64 bit countes should be used. Obviously 64bit would keep your counters from resetting frequently.

I just assumed that if your device is capable of 64bit counters to just use it by default. It seem that this is not the case.

I am not an snmp guru by any means. I am just happy I can supply accurate reports on traffic.

Regards
Post Reply