OpenNMS ships with the thresholds for some events already defined. For example, there is a memory threshold defined as:
<group name="netsnmp-memory-nonlinux" rrdRepository="/opt/opennms/share/rrd/snmp/">
<expression type="low" expression="memAvailReal / memTotalReal * 100.0" ds-type="node" ds-label="" value="5.0" rearm="10.0" trigger="2"/>
ie. if free memory drops below 5% then an event will be created. The alert will be cancelled automatically if free memory subsequently rises above 10%
I wanted to configure some specific nodes with a different threshold, eg. generate an event when free memory drops below 2.5%.
Here's what I did.
We’ve had a bunch of new servers in place for around 3 months now. They seem to be working well and are performing just fine.
Then, out of the blue, our monitoring started throwing alerts on seemingly random servers. Our queues were building up – basically, database performance had dropped dramatically and our processing scripts couldn’t stuff data into the DBs fast enough.
What could be causing it?
I'm using the net-snmp-lvs module to interface LVS statistics to SNMP so I can graph them (I'm using OpenNMS).
I have a virtual HTTP service that is balanced across eight real servers. In testing, everything seemed to work just fine and I got some nice graphs that show the Connection Rate, Packet Rate, and Byte Rate for the virtual service and each of the real servers.
This morning, we attempted a cutover, ie. we re-directed real traffic to the new service. Sadly, our perimeter firewall hit > 90% CPU so we had to revert. But, in the time that we were live, I noticed that the Connection Rate statistics were missing for both the virtual service and the real servers for the period in which the service was under high load:
Notice the gap in the Connection Rate graph when the Packet & Byte rate graphs show high values.
I am currently investigating the cause of this issue.