How not to design distributed applications
This is an initial response to Anant Narayanan’s inquiry as to where I disagreed with his talk. So, I want to thank him for being kind enough to engage me on Twitter, which prompted me to write this. Perhaps I have misinterpreted some statements or saw a fault where there was none. I’m happy to have that discussion in the comments.
The 2013 Realtime Conference looks amazing. If you haven’t seen the talks, go ahead and treat yourself here: http://2013.realtimeconf.com/video/. It is well worth your time, inspirational, and entertaining.
However, there was one talk that stood out, which struck me intuitively as “all sorts of wrong”, and that was Anant Narayanan’s “Message Passing vs. Data Synchronization” (http://vimeo.com/77352415). This is a write up of my thoughts as Anant kindly wanted to dig into what I thought and where I disagreed.
Anant rightfully points out that simple implementations of Message Passing do not cut it. There are things to consider in the design of distributed systems and he highlights: persistence, fault tolerance, scaling, consistency, and security. All of these have been solved before, as he points out, but then he goes on to state his thesis that none of the code implementing those solutions belongs in our application code. This is where “all sorts of wrong” begins.
The entire talk serves as a warning against going along with the thesis, for the simple reason that if we abstract persistence, fault tolerance, scaling, consistency, and security to something else without understanding what we’re doing, we risk relying on a wrong implementation, as shown throughout the talk. Also, what is it that we’re building at that point and why are we getting paid to do it?
The talk states that “message passing is just a primitive” and that we should abstract away from it. This is fine, as long as the abstractions do not lose touch with asynchronous and faulty nature of message passing. More importantly, the abstractions should also be correct. It is very easy to cross the fine line from abstraction into ambiguation, and the examples presented in the talk do just that.
We are shown a strawman abstraction comparison between the progression from DOM manipulation to jQuery to Angularjs/Ember.js which is contrasted with progression from WebSockets to Socket.io to “?”. The talk goes on to suggest that the progression from jQuery to Angularjs and Ember.js is equivalent to a progression from Socket.io to persistent, fault tolerant, scalable, consistent, and secure applications. The talk proposes we are “just” one abstraction step from realizing the ultimate distributed dream. I’m sorry, but this is nowhere near a fair comparison and it overly encourages the viewer to make gross simplifications which will end up badly. Why? Because the proposed solution is “Data Synchronization.”
In describing what “Data Synchronization” is, the talk posits that “most apps observe and modify data” without pausing and asking how desirable that is in a distributed system in the first place. A common pattern does not imply a correct pattern. This is the wrong metaphor to begin with, and it leads at first to what appear on the surface to be similar solutions, but then diverges quite rapidly into the wrong types of solutions for distributed systems. To contrast, consider a distributed application that is composed of services where “services respond to events and emit commands.” I can make both metaphors fit, but the latter one is better choice for distributed system design.
Next, the talk posits that “data reflects state.” This is ok on its own, but again, the metaphor implied in this statement is the wrong kind. This is because Anant speaks as though he is implying a global application state. To contrast, there is a different kind of state, which would be distributed application state that requires no synchronization. If an application is designed using services that are properly bounded and do not require access to each other’s data, there is no data in the entire application that would require synchronization. As before, one of these metaphors is appropriate for distributed systems and one of them isn’t.
The talk proceeds to describe a chat system where the chat information is stored in some synchronized data store into which we can insert rows and which can replay rows on demand when a user needs to see them. We have systems now large enough that I can point to Twitter architecture as an existence proof that this is not a way to design a distributed chat application. It is the classic “it works until it doesn’t” design, and if we start where the talk suggests, we will end up with a grand rewrite once we reach a certain threshold. We already know how to build these systems, we could start with the correct design, but this would require us to familiarize ourselves with persistence, fault tolerance, scaling, consistency, and security concerns, which the talk attempts to convince is not necessary via statements like: “you want this layer of data synchronization”. No, actually, you don’t.
Next, the talk takes on fault tolerance by arguing that checking for errors (as a result of message passing) does not belong in our application code. This only holds until we have to receive confirmation that the command (a message) the application sent was actually processed, and at that point, we will find ourselves checking for errors. Additionally, because we are working on a distributed system, sometimes the simplest way of correcting an error will be through user action and not some form of automatic retry. This is a property of open systems, and to build those, we cannot assume the abstraction of Data Synchronization.
When the talk addresses security, I cannot fault this stance, because web application security is broken across the industry. We have abused the Access Control List (ACL) model to death and most have forgotten about (or never thought to look for) a better alternative which is Object Capabilities. It is, however, a complex topic and a subject for another post. If you are interested you can check out http://www.erights.org/elib/capability/index.html, which is a great repository on the subject. To demonstrate that this is not some fringe technology, you could think about what OAuth2 (http://tools.ietf.org/html/rfc6749) and Bearer Token usage (http://tools.ietf.org/html/rfc6750) give you and how one would implement a client on top of an API using Bearer Tokens instead of a classic RESTful web API.
The talk sums up the above examples by stating that this is why we shouldn’t concern ourselves with Message Passing. I argue that due to what I outlined above we should absolutely concern ourselves with Message Passing and other distributed concerns. Otherwise, we’ll end up implementing systems that will break from poor design.
The talk then demonstrates the “advantages” of Data Synchronization, and the first slide asks “why not directly store state?”. For an answer to that, I refer you to Nathan Marz’s Lambda Architecture talk (http://www.infoq.com/presentations/Complexity-Big-Data) and let you answer that for yourself instead of accepting Anant’s answer. Additionally, there are multiple advantages realized using Event Sourcing and CQRS (see http://en.wikipedia.org/wiki/Domain-driven_design) that one gives up by “directly storing state.” We are then shown how distributed counters are “inefficient” by demonstration of a distributed counter implemented the wrong way. For how to implement distributed counters take a look at Distributed Counters in Cassandra (http://www.datastax.com/wp-content/uploads/2011/07/cassandra_sf_counters.pdf) for how it can be done in Cassandra, or consider the problem of probabilistic counters described in Big Data Counting: How to Count a Billion Distinct Objects Using Only 1.5 KB Of Memory (http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html). It’s not an easy problem.
Anant leaves us with the question “where is the Angular or Ember of data”? I would say, that perhaps Joyent’s Manta (http://www.joyent.com/products/manta) is a good start. However, Data Synchronization, as presented, is definitely not the answer.
I would like to thank Scott Bellware for taking the time to review this post and suggesting improvements.
Thanks Tristan for this thoughtful analysis. I don’t necessarily agree with all of your points, but in general, I agree that if you know you’re building a distributed system, you probably need to carefully architect it and implement it yourself, probably on top of message passing. However, lots and lots of apps (especially web based ones) have similar requirements and data synchronization is a very useful abstraction for those types of apps.
If you’re Twitter or Facebook, you likely have the resources needed to architect a distributed system from the ground up, and in those specific cases data synchronization may not be the best fit, especially if you can design something lower level and more specific to the requirements at hand. But, there’s a huge spectrum between “Hobby Project” and “Facebook”, and not every application needs a finely tuned system, especially at the layer of message passing. I think the main message of my talk is to look to see if your application falls in that spectrum, and if it’s dealing with a data synchronization problem, then to consider writing (or using) the correct abstraction on top of message passing.
Thanks again for posting your thoughts on the talk, I’m always looking for feedback on how to improve the content and delivery!
Hey Anant, thank you for reading and commenting.
I think my thesis is that you don’t need to be Twitter or Facebook to design a distributed system and my fear is that “you can’t get there from here” from a “Hobby Project” to “Facebook”.