Monitoring 101: What To Monitor?

One of the most basic requirement of a healthy infrastructure is to ensure that we have adequate monitoring in place. The monitoring approaches change with infrastructure scale and kind of applications. However, we have often observed that companies stick with the kind of monitoring they started with when they were quite small and don't upgrade with the rest of their infrastructure. In this post we are going to look at some of the most common strategies for monitoring infrastructures. But before we begin, let us see some terminologies:

  • Monitoring: Checking the status or value of anything of interest like website availability or CPU load.
  • Alerting: Sending an alert up on reaching a threshold. Important alerts should be sent by an active method like a phone text, push notification or automated voice calls.
  • Graphing: We should graph certain values in order to identify trends and correlate anomalies. This is very helpful in debugging and capacity planning.

Small Infra (less than 10 servers)

This is for companies that are starting up. Monitoring is important even for small organizations. We don't want our clients to lose customers because one of the features was down and there was no way to know which one and when. Here is what to monitor:

  • Overall uptime of the application. This should be the endpoint that customers use like the home page or login page. Use a third party tool like pingdom for this. Often we are tempted to run a cron or similar tool within our infrastructure to check this but there are cases when things might work within the infrastructure but not outside. This could be a firewall or a bad network uplink.
  • For small infrastructures, usually each server is important. So monitor all of them. Monitor all the usual parameter like CPU load, RAM utilization and disk space.
  • For databases, monitor backup size and duration, in addition to CPU load, RAM and disk. If the replication is on, then monitor replication lag.
  • Monitor internal http and TCP endpoints.
  • Monitor the SSL certificate expiry date.

Medium sized infrastructure (from tens to a couple of thousands of servers)

These are usually the companies which have a stable or growing user base. By this point, the first thing to take care of is to ensure that the applications are horizontally scalable. Loss of a single server should not have a significant impact. So the first order of business is to ensure that we have the kind of scalability and fault tolerance required for good infrastructure. Once that is done, here is what to monitor:

  • Overall uptime of applications using a third party tool like pingdom.
  • Overall response time of the application. Since the individual instances are not as important, one good way to measure the health of application is to measure latency. This should be done using a third party tool. However, be sure to pick a reliable third party tool which does not grapple with internet speed issues itself.
  • At this scale, graphing for trend analysis is very important. Make sure that parameters like response time, error rates, CPU load, memory utilization etc. This way we can see if a bad deployment or a an underlying infrastructure issue is making things bad for our users.
  • Make sure that databases have replicas and that failover is well tested. Monitor the usual parameters like replication delay, backup size, CPU load, memory. This time, start monitoring the connections as well.
  • If the application is hosted on a cloud then monitor the quotas. This can be done periodically or before any significant event.
  • Monitor internal http and TCP endpoints.
  • Monitor the SSL certificate expiry date.

Large scale infrastructure (thousands of servers and more)

At this scale, the user base is large and so is the infrastructure. Monitoring individual server doesn't make any sense. Instead, we would monitor the overall health of application and infrastructure. Here is what to monitor:

  • Overall uptime of applications using a third party tool like pingdom.
  • Overall response time of applications and other endpoints.
  • Overall error rates of the applications and other endpoints.
  • Graphing parameters of clusters to monitor how much resources each cluster is using up and when is a good time to scale. This also helps in ensuring that there is sufficient capacity to scale up, if required.
  • Ideally, the infrastructure at this point should be self-healing. In cloud environments, damaged server should be able to replace itself. If that is not the case, then we should monitor individual servers and send out a low priority alert so that bad servers can be fixed and placed back manually.
  • Monitor the SSL certificate expiry date.

All all times, we need to ensure that we monitor the monitoring infrastructure using a third party tool.

Need help for setting up your monitoring infrastructure? We do all of the above and more. Contact us for more information.