Monitoring (a Rails app) in a production environment
Running production apps & infrastructure needs someone (or something) to keep a eye on everything to make sure it all works well. In this article we show you how we do that with our ForwardMX Rails application.
Note: Every third party monitoring product mentioned here offers a free plan!
We have several different parts that we want to monitor. Essentially: the website (the application), The 2 MX servers & all the different Cron jobs. But first lets see what we are protecting from.
Things that go wrong:
- Application Errors & Unexpected user inputs
- Servers run out of RAM or Diskspace
- Servers has networking / hardware issues
- Cronjobs can fail. Especially those that depend on external services
- Any infrastructure is offline for some reason
Never be to optimistic here. If your IP is public you'll be attacked. If your server runs unmonitored it will run out on Diskspace at some point, hardware does break and so on. Also it is nice to look at the warnings log and see nothing and know this means everything is actually working.
There are dozens of ways to receive these notifications. We use Email & Slack which is supported out of the box with nearly all further external tools we will explain.
Both Slack and Email are more or less instant. A email client like Gmail or Inbox will learn that you care about those messages and priorize them accordingly. In Slack every message is threated the same, just make sure to not disable notifications on the channel.
Application Errors & Unexpected user input
We have a zero tolerance with errors in our application. Every error that is found will be fixed when seen. Every error also opens a new Trello card in our "todo" section and mirrors into our Slack.
Integrating Rollbar with Rails (or any modern web framework) is super easy. Essentially you just include the gem
rollbar and set your API key. From then on errors will automatically be sent to Rollbar. Make sure to check your Rollbar notification settings, as well as you might want to hide the development environment.
Another very popular alternative is Sentry, which also comes with a generous free plan and offers otherwise similar functionality.
Server & Application Health
The biggest player on this market is New Relic which is also what we use. They also offer a generous free plan. A popular alternative would be Datadog, but I'll focus on New Relic here as this is what we use and know.
We use APM (Application Performance Monitoring) and classic Server Monitoring. APM is directly installed in our Rails application (this also supports any other major framework/language) and track data like transaction time, error rate and even details like Ruby VM heap size / RAM usage, etc. Server monitoring is especially helpful to keep track of CPU, RAM and Disk usage.
Make sure to check your alert policies on every option to have the right events triggered. This is also where you can hook up Slack, Campfire or HipChat to receive your events.
Its worth to mention that the following is more a gimmic, if cronjobs fail with a exception they will end up in your error logging solution (Rollbar in our case) anyway.
There are plenty solutions on the market. One we found particular interesting is Healthchecks.io. The idea is to ping a specific private URL every time the Cronjob is running and set a time range you expect them to run on the sites dashboard. It is simple but very effective.
If a cronjob times out you will get a Email notification about it. We also use the Healthchecks.io API to trigger additional alerts our self.
Every opened up one of your websites just to see that it is not even online anymore? This is what you get without uptime monitoring. This is why we are big fans of UptimeRobot. Their free package includes 50 monitors checked every 5 minutes and their Pro plan already starts at $5.
We setup monitors for every single service and server. We ping all the servers for their availability, ping the services like the email servers, the website, the internal API, ... It also includes several options for Notifications including Twitter & Slack.
Uptime Robot also pleases us with the metric of page load time which can be relevant for user satisfaction and SEO. Its always a good idea to keep a eye on it.
This is essentially it. Now make sure to receive everything and be able to react. If you like to sleep well hire someone in a different timezone to idle in your Slack and react when something important happens.
The incoming update will include a health site, publicly presenting a lot of these monitoring metrics!