Have you ever watched Coda Hale's Metrics, Metrics Everywhere talk? A few choice graphs can show incredible insight into your entire application, touching on many areas. And best of all, they're often easy to implement.
Two simple super critical charts
Towards the end of summer I decided to create a little side project. The idea was that I would build a swatch of web crawlers, they'd crawl a boutique camera store, and record the price and inventory levels of products into a database. I'd then wrap a simple web application around that database. Users can view price history or sign up for price drop alerts.
Right from the beginning I realized I could make two charts to get an idea of how my web crawlers were behaving. All I did was create a few simple SQL queries (think
select count(*) from table t where t.created_on = today) that ran on a schedule and appended to a .txt file that D3.js would use to populate some charts. Here's what they look like:
What do these two charts and stats cover? A lot!
- How many products I'm monitoring
- The rate at which new products are being added
- How my web crawlers are doing right now, in the last 6 hours, and in the last 24 hours (think of
top's load averages)
- How my web crawlers were doing yesterday, or the day before, or any day in the last 6 months.
- Any trends or slowdowns. For example, there were some dips in the bottom chart which corresponded to days I performed data migrations.
- How performant my web crawlers are after 3 rewrites in 3 different languages/frameworks. I can point to this chart and be certain my rewrites are working better than before. Look at the first month where I was getting 5k to 20k price checks per day, compared to the 35k I get these days.
Based on historical data, I know what to expect tomorrow. If the numbers aren't performing to where I would expect, I immediately know something is up. It turned out to be trivial to wire up an an email alert if I don't get >6k price checks in the last 6 hours, allowing me to be proactive in hunting down the problem. Could it be my database? Could it be the crawlers? Could it be one of the EC2 instances my crawlers are running on? Could it be the proxies that my crawlers are using? Could it be that the website (that I'm crawling) is down? Could it be the website (that I'm crawling) changed its content or format?
These graphs don't necessarily isolate the problem, but they very quickly let me know of anything out of the ordinary so I can start investigating. And after a few months of these graphs I realized the six potential "problem areas" in the last paragraph haven't ever been problems at all. In the few hours that I took creating these graphs I saved countless days of needlessly optimizing or worrying over my architecture.
But if you can't be spared a few hours...
How about free graphs?
About a month ago I was perusing the Google crawl stats graphs. I noticed a steady climb of the average time spent downloading a page. When I launched this little project of mine the average download time was around 100-200ms. Over the course of a few months it was increasing, up to around 3000ms!
Of course you're now thinking, "3 seconds to load a page! What's going on!?"
I was thinking the same thing. I performed some investigations and it turned out I totally forgot to add an index on a column that's used as the key in part of a query that gets ran on each page request. Not only that, but other services were adding to this table on a regular basis, thereby increasing the page response time every day!
I slapped my head when I realized this silly mistake and added the index. A few months later I checked the crawl stats and was happy to see this (last chart at the bottom):
That's pretty good evidence of the problem AND the solution. Rock on!
Update on 1/16/2014:
I spent some time playing around with InfluxDB a couple days ago and... Wow. If I need pretty charts (or super specific ones) for the end-user to see, I'll use the methods described here (SQL selects appended to .txt and drawn with D3.js).
If on the other hand I just a place to throw metrics at, to later monitor (historical or realtime) I'm most certainly going to use InfluxDB (the successor to Graphite as far as I can tell). There's also Grafana that I haven't had a chance to explore, but that looks like an awesome way to design monitors and dashboards. It's on my to do.
Update on 2/4/2014:
: Although I wonder, since this is a little low-traffic side project of mine, if I were to get an email alert while I'm out with friends... just how proactive I would be in running home and firing up a bunch of remote terminals in a hurry, furiously crunching out SQL queries and grepping over logs...