RMA False Positive

All questions related to installations, configurations and maintenance of Advanced Host Monitor (including additional tools such as RMA for Windows, RMA Manager, Web Servie, RCC).
Post Reply
Wraithlabs.Net
Posts: 6
Joined: Fri Jan 16, 2015 10:28 am
Location: Dallas, Texas
Contact:

RMA False Positive

Post by Wraithlabs.Net »

Over the past couple weeks we have noticed that a couple tests are failing quite frequently. To resolve this issue we spun up another server and installed the Active RMA on the new server. after switching the tests that were failing to this new server the tests are stable and no longer failing. This is not a long term solution to our problem because we cant just spin up a new server every time we start getting failing tests. We are running HM ver. 9.9 with RMA ver. 4.88. The HM server is a windows 2008 r2 sp1 box and the RMA server is windows server 2008 r2 sp1 for both RMA servers. we have no special winsock installed and we are not using obdc logging. The auditing tool says that we are running 9 tests/sec but says the box should be able to handle the workload with no issue. I am wondering why we had to spin up another server just to share the work load and stop the tests from failing. Can anyone help me determine the cause of the test failures. the tests we are running are URL request tests for https sites. Before switching the test to the new RMA they were failing every couple minutes but the site is stable and was able to be accessed with no issue when the tests failed. The tests are now stable and not failing as they should have originally been doing. Can someone help me understand why the tests failed on another RMA and now are stable but the boxes are identical.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

I am wondering why we had to spin up another server just to share the work load and stop the tests from failing.
New server?? I think its more easy to restart just agent. Does not help? Server restart? Does not help?
Could you please check resource usage on system that does not work properly?

You may use standard Windows Task Manager to check Handles, GDI and USER objects. What is the total resource usage on the system? How many handles/threads/GDI objects used by RMA?
the tests we are running are URL request tests for https sites
Just URL tests? No other methods?
Before switching the test to the new RMA they were failing every couple >minutes but the site is stable and was able to be accessed with no issue >when the tests failed.
What exactly means "failing"? Test status?
Bad? Unknown? No answer? Unknown host? Any error in Reply field?

Regards
Alex
Wraithlabs.Net
Posts: 6
Joined: Fri Jan 16, 2015 10:28 am
Location: Dallas, Texas
Contact:

Post by Wraithlabs.Net »

We tried to restart both the server and the RMA it self with no luck. The server it self is not overwhelmed. The CPU usage is 1% memory is at 42% used. 712 Handles, 15 threads, 0 user objects and 0 GDI objects. we have 767 tests in total but the tests that were failing were strictly URL request tests checking https. The test was displaying No Answer when the tests were down. Since sunday the tests have failed 953 times. So each of the 6 tests have failed 953 times individually. 5718 times in total which is killing our SLA stats that our customers see. The new servers CPU usage is at 0% with 15% mem usage and the rma process has 355 handles, 14 threads, 0 user objects and 0 GDi objects. The systems don't appear to be overwhelmed at all. I would think if the RMA was hung up it would fail other tests and not just these 6 tests since the RMA it self run about 80 tests on it alone. What info can i send you to help narrow down the cause of the failures.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

We tried to restart both the server and the RMA it self with no luck.
Then its not RMA problem. May be network card/driver does not work properly?
What timeout do you use for tests?
Some 3rd party software, like antivirus?
Is this virtual or physical server?
The CPU usage is 1% memory is at 42% used. 712 Handles, 15 threads, 0 user objects and 0 GDI objects
Looks normal for RMA.
What about other processes and total usage (total handles, threads)?
I would think if the RMA was hung up it would fail other tests and not just these 6 tests since the RMA it self run about 80 tests on it alone.
If RMA hung up, you would see Unknown test status and reply like "RMA: Connection error"
No answer means RMA works just fine but does not receive answer from target server.

You check HTTPS servers... may be there is some problem with client certificate? If these servers require client certificate that was removed this may lead to such problem.
Is it always return "No answer" status?
or sometimes "Host is alive? in 2nd case it looks like timeout issue, may be network too slow.
What info can i send you to help narrow down the cause of the failures.
Try to check network traffic between RMA and WebServer. You will see is there any responce. Is there responce within timeout...

Regards
Alex
Wraithlabs.Net
Posts: 6
Joined: Fri Jan 16, 2015 10:28 am
Location: Dallas, Texas
Contact:

Post by Wraithlabs.Net »

We have no antivirus on this server. The timeout is set to 30 seconds. The two RMA servers are virtual servers and the host monitor box is a physical server. We have the tests set to allow for redirects and accept cert with invalid host and invalid dates. So i wouldn't think the cert would be causing the issue. And the same tests are being used with the new RMA server so if it was the test checking the wrong site then it should have showing to display no answer with both RMA's. Both RMA servers are on the same host and have the same networks configured and the same nics, they are identical except the new server only has 6 tests being run off of it. The latency for both RMA's is around 2100 ms (the reply for the test not the RMA it self) so well within the 30 second timeout. The tests were flipping between Host is Alive and no answer every couple minutes. So it could contact the site, it just appeared that the RMA was getting over loaded even though we only run 9 tests/sec in total. Now that we have transferred the tests to the new RMA the test has not experienced an outage. but like i said in the original post this isnt a long term solution. What is the maximum amount of tests that can be performed with a RMA? Number of tests being run and frequency. With the box being a clone of the other server and switching the RMA to the new box leads me to believe that the RMA is being overwhelmed as there is nothing different between the boxes with the exception of the name and the number of tests being run by each RMA. I am incharge of the monitoring systems of my company and my manager is asking me what is causing the outages, so i am looking for some clarity as to why an identical box would solve the issue if there is nothing different between the boxes expect load. Leads me to believe there is a upper limit to the amount of test that can be run per RMA even though i cant find any documentation regarding the limits of the software. Please let me know what else we can do to try to determine the root cause of this issue. We currently have 105 tests running on the original RMA and 6 tests on the new RMA. Thank you for you quick response to this issue.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

What is the maximum amount of tests that can be performed with a RMA? Number of tests being run and frequency.
Depends on your system, test method, test settings. Usually 50 tests per second is not a problem for RMA.
But if you start some heavy tests (e.g. 5 tests to check for some string within 500GB log files, 5 tests to count files within folder with 1000 subfolders), this will lead to much higher load than 50 ping tests.
So, what exactly 105 tests are running thru this agent?

btw: is this Passive RMA or Active RMA?
The tests were flipping between Host is Alive and no answer every couple minutes. So it could contact the site, it just appeared that the RMA was getting over loaded
Have you used network traffic analyzer to check if system sends request and gets reply from web server?

Regards
Alex
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Over the past couple weeks we have noticed that a couple tests are failing quite frequently
RMA was installed couple weeks ago? or it was installed year ago and everything worked fine before?

Regards
Alex
Wraithlabs.Net
Posts: 6
Joined: Fri Jan 16, 2015 10:28 am
Location: Dallas, Texas
Contact:

Post by Wraithlabs.Net »

The 105 tests include, 100 URL request, 3 Ping, 1 SNMP test and 1 ODBC test. The original RMA was installed more than a year ago and just started having the tests fail about 3 weeks ago. But before 3 weeks ago everything was acting normally. Both of these RMA's are active RMA's. What network traffic analyzer should we use? I have used wireshark in previous situations. What type of traffic should i be looking for. Just the traffic between the RMA and Host monitor?
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

The 105 tests include, 100 URL request, 3 Ping, 1 SNMP test and 1 ODBC test.
What ODBC driver do you use? Some drivers have a lot of bugs (especially Oracle).
But usually bugs in ODBC drivers lead to problems after some time (resource leakage) so after restart everything works fine for a while.
The original RMA was installed more than a year ago and just started having the tests fail about 3 weeks ago. But before 3 weeks ago everything was acting normally.
This means something was changed 3 weeks ago? system updates? driver updates?
What network traffic analyzer should we use? I have used wireshark in previous situations. What type of traffic should i be looking for.
We like wireshark as well.
Just the traffic between the RMA and Host monitor?
As I said there is no problem with HostMonitor <-> RMA communication. Otherwise you would see different status and error message.
"No answer" test status means RMA does not get answer from target system.
So, if URL(HTTPS) test does not work, you should check HTTPS traffic (RMA <-> web site)

Regards
Alex
Wraithlabs.Net
Posts: 6
Joined: Fri Jan 16, 2015 10:28 am
Location: Dallas, Texas
Contact:

Post by Wraithlabs.Net »

We are using "SQL Server" 6.01.7601.17514 Microsoft Corporation sqlsrv32.dll and SQL Server Native Client 10.0 2009.100.1600.01 Microsoft Corporation sqlncli10.dll. No driver updates and windows update have not been done for at least a year on the two boxes. I will use wireshark and see what i can see as far as traffic goes. What ODBC drivers should be used on the servers? Where can i locate drivers to install if they need to be updated?
KS-Soft Europe
Posts: 2832
Joined: Tue May 16, 2006 4:41 am
Contact:

Post by KS-Soft Europe »

Microsoft SQL Server ODBC drivers should not make problems.
KS-Soft Europe
Posts: 2832
Joined: Tue May 16, 2006 4:41 am
Contact:

Post by KS-Soft Europe »

Where can i locate drivers to install if they need to be updated?
Drivers should be available on developer's website.
Wraithlabs.Net
Posts: 6
Joined: Fri Jan 16, 2015 10:28 am
Location: Dallas, Texas
Contact:

Post by Wraithlabs.Net »

After further testing we have determined that the issue is load on the RMA itself. We have transferred the tests that were being performed on one of our other RMA's. After transferring the tests to the new RMA the test began to fail just as they had on the first RMA. The only difference between the two RMA's is the load of test being placed on them. We had not gotten any hard documentation on what the RMA's can handle but after performing these tests there is no other options but for the load to be the cause. As we are only performing URL tests on the RMA's. I am wondering though why host monitor says load should not be an issue yet it is an issue. We were performing 165 tests on the single RMA and now have set them to be 82 and 83 tests on each RMA with no issue. We want to know why this isn't addressed in the website documentation. The resource demands on each server is very low and as stated in earlier posts by ks-soft the strain shouldn't be to much for the RMA's but there is an issue with the load that needs to be addressed by the developers to better handle the amount of tests being performed by the RMA's. The RMA's perform the tests at 1 per minute at most. We are performing 8 tests/sec. If i could get some feedback on this it would be appreciated.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

We want to know why this isn't addressed in the website documentation
Because
a) nobody else experience such problem.
b) you did not have any problems for a year. As you said "The original RMA was installed more than a year ago and just started having the tests fail about 3 weeks ago. But before 3 weeks ago everything was acting normally."
This means RMA works fine. Something happened to your network 3 weeks ago (ok, now its 5 weeks ago)

May be you updated some firewall in your network, now it considers too many similar requests from single IP as attack and blocks connection from this host?
Have you used network traffic analyzer to check if system sends request and gets reply from web server?

If you see traffic (reply from the server) while RMA shows "No answer" status, then could you please send your test list to support@ks-soft.net? If there is some mistake in our code (or Windows WinHTTP module), we should be able to reproduce the problem.
Can we access target servers?

Regards
Alex
Post Reply