Tuesday, February 8, 2011

It's Not the Size, It's How You Use It

Yesterday I saw a flurry of discussion surrounding a post deriding not just message queueing systems but people who choose to use them. Normally, considering the tone of the post, I would just ignore it as a rant (did he get saddled with a poor implementation somewhere?), but in this case I actually use queues in a moderate-sized system and I'd like to think I'm not a complete idiot. The post makes some statements about queues that contain some truth to them (as well as some that appear misinformed), but mostly it overgeneralizes and doesn't attempt to provide any in-depth discussion about the issues you need to consider when using (or choosing to use) queues. It certainly doesn't talk about any benefits. That's what this post is for.

First, a caveat: by no means am I an expert in queuing systems, but I'd like to think that I'm a decent programmer and I've been working with queuing systems in one form or another for about 5 years. If I misstate something here please let me know so I can correct it.

The Router Analogy

As a network engineer, I deal with queues all of the time. These aren't the message passing queues that Ted is talking about per se (I'll get to those shortly), but they share some characteristics. First and foremost, the queues in routers and switches provide elasticity, which in turn provides better link utilization. Network traffic can be quite bursty, and without queues to smooth out the bursts congestion would have a much more significant (and negative) effect than it otherwise does. The same holds true for a job processing system or any other message passing scenario: a queue allows you to plan for the steady state but deal with the occasional spike.

Another use of queues, coming from the networking perspective, is prioritization. In network equipment prioritization is typically done via 2 or more queues for a given egress port, with a variety of scheduling algorithms used to pull messages from the different queues. You could do the exact same thing with a queuing system, although many brokers support some form of prioritization within a given queue.

As Ted points out, you can achieve both of these goals with a database, syslog, or even flat text files (I repeat myself), but the question is, do you want to? Sure, I can see scenarios where you might have relatively few/relaxed requirements, but a modern message broker is not the simplistic tool that Ted makes them out to be. For all but the simplest cases you're going to be duplicating a lot of work that others have done (NIH syndrome?) and you probably won't write something as scalable or robust as what's already out there.

Dude, Where's My Message?

One of the claims that the post makes is that

Depending on your queue implementation, when you pop a message off, it's gone. The consumer acknowledges receipt of it, and the queue forgets about it. So, if your data processor fails, you've got data loss.

Well, the key there is "Depending on your queue implementation". Most brokers I've worked with offer the choice between auto-ACK and explicit ACK. They also can provide things like transactions and durable messages, but I'm not going to get into too much detail here. The way you typically deal with work that you don't want to lose is to pop the message, process it, and then ACK it after you're finished. This is called "at-least-once" processing because if your consumer falls over after it finishes processing but before the ACK, the message will go back on the queue and some other consumer will grab the message and reprocess it. For this reason it's recommended that the processing is idempotent. Without the explicit ACK messaging is basically "at-most-once". There are also cases where you really don't care if you lose a message, so no one is trying to say that one size fits all.

You can also try for "exactly-once", but things get significantly more complex. In fact, you really can't achieve it 100% of the time because, hey, we live in a non-deterministic world. There will always be some failure mode, although you can certainly approximate 100% guarantees to successive decimal points. In other words, your solution depends on what's good enough for your requirements.

As an example, let's look at the post's suggestion of using a database or syslog for collecting job messages. Publishing to syslog is unacknowledged, and clients often have buffers sitting between them and the syslog daemon (network, I/O, etc). It's entirely possible that work would get lost because the message never makes it to the syslog daemon in the first place. Once it does make it to the log you need to keep track of which messages you've processed and which ones remain. You're keeping state somewhere, so with a failure it's still possible that your consumer could reprocess messages, etc. Same thing with databases. You can use locking, transactions, etc to try and guard against lost/reprocessed messages and you're still in the same boat as with message queuing, you're just writing a lot of the logic yourself.

Shades of Blur

Another point that the post tries to make is the intractability of monitoring a queuing system. I think that the claim that using a message queue

...blurs your mental model of what's going on. You end up expecting synchronous behavior out of a system that's asynchronous.

is missing the point. What I would say instead is that you need to really understand the systems you're using. That means failure modes, limitations, and fundamentals of operation. This is me speaking not just as a developer, but as a network engineer and sysadmin. Disks will crash, RAID arrays will get corrupted, and BGP will mysteriously route your traffic from Saint Louis to Charlotte via a heavily congested T3 in DC (thanks Cogent!). I've seen syslog crash and fall over on Solaris and I've seen databases corrupt in all sorts of unique ways.

If you need strict synchronous behavior then a messaging system is probably the wrong tool. But if you understand the limitations going in then you make an informed decision. System engineering and architecture is seldom as black-and-white as this post portrays.

Keeping an Eye on Things

As I said before, queues can provide elasticity for the occasional spike, but if you're constantly running behind you need to reassess what you're doing. Add more consumers, prioritize, etc. Re-architecting for scale is not something unique to message queues. Database sharding, load balancing, etc, all attest to this.

A major component of this is monitoring, and monitoring well. Ted asks how you can monitor a queue with just the length. For starters, throughput, and therefor wait time (from production to consumption) is usually a more important metric in the systems I've worked on. Queue length could vary significantly from moment to moment depending on your production patterns, but if there's a good bound on how long messages take to process then you don't really care. If your processing time per-message stays fairly constant then queue length can be a good predictor for queue wait, but this isn't always the case.

For example, at my job I write and maintain an image processing system that renders images for our orders. Each resulting image can be comprised of one or more source images and one or more operations on those images. The time to render varies depending the script, but generally stays within a certain bound. In our usage we monitor not only the queue length but the average queue throughput. We also record the duration for each render as well as each render's individual operations (crop, compose, etc), but that's more for tuning and QA on the render engines than for indicating health of the system at this point. So far this system has served us well. We have a good idea of how big the queue gets depending on time of year, day of week, etc, and we know that with our current set of servers we should be seeing roughly the same throughput at any given time with a non-empty queue. What constitutes a healthy system is going to be different for every system whether or not it uses a queue (how big is my table, how large is the logfile, etc), so I don't see how not using a queue is going to help. In fact, most brokers I've worked with have good facilities for monitoring built-in (via SNMP, REST, etc), so again, this is something where you can either leverage other people's work or you ending up re-writing it all yourself.

Additionally, you should be monitoring the consumers. Even if you don't care about how long things take to process, you should know that things are being processed. This is not the difficult problem that the post makes it out to be, and again, you're going to want to do this no matter what mechanism you're using to move jobs around. Do you think that not using a queue is going to magically obviate the need to keep an eye on the whole system? Really?

Why I Use Queues

Thus far I've basically worked at refuting or expanding on some of the points (accusations?) made in the post. Now I would like to cover some of the things that I think make queuing systems beneficial.

First, queues not only decouple the producer and consumer, but they decouple the routing between the producer and consumer. In some systems the routing is configured on the broker itself, in others the routing can be specified by the consumers (AMQP, for example). This allows a lot of flexibility in terms of how you process messages. Because the transport between producer and consumer is abstracted, you can do interesting things like message transformation by re-routing the messages via a middle-man (see Apache Camel, among others). Not everyone will need this flexibility, but when you need it you don't have to make any changes to your producers or consumers.

Another aspect of this decoupling is the ability to transparently inspect traffic going through the queue. Much as you would configure a mirror port on a switch to do a packet capture, we can tap into a queue and see the traffic that's going through production. In my case I use this to sample the work queue to build QA data sets. That way we can run a representative sample of work without having to build our own artificial jobs to mimic the real work.

The second thing I like about queuing systems is that they can provide a unified infrastructure service for job management. Now, there is always the danger of holding a hammer and seeing everything as a nail. In our enterprise, however, we have a lot of disparate systems that do batch processing and with queuing we can get a better overall picture of the processing via a single service. We're also looking at federating 3000+ brokers so that our various sites can use a global addressing scheme to support disconnected operation (a limitation of our environment) without having to put all of the logic into each and every client and service. Given the option of writing an federated job system by myself and writing a small federation agent (about 600 lines of Scala) I'm always going to choose the latter.

Is He Done Yet?

Almost. Message queuing systems, like any other arrow in an engineer's quiver, can be misapplied. But the mis-/overuse of a given tool by the naive is no reason to tar it (or the people who use it) as useless. In the 14 or so years I've been programming I've seen misuse of a lot of techniques and tools, but that just prompts me to learn from those mistakes and try to pass what I've learned on to others. In that spirit, I hope this post succeeds in making developers better informed about queuing systems and where they work and where they don't.