At Jane Street, we often write OCaml programs that communicate over
the network with each other, and as such, we need to build lots of
little protocols for those programs to use. Macro systems like
sexplib and binprot make the generation of such protocols simpler.
The basic workflow is to create a module that contains types
corresponding to the messages in the protocol. Macros can then be
used to generate the serialization and deserialization functions.
Just share the protocol module between the different programs that
need to communicate with each other, and --poof-- you have a protocol.
This is a highly convenient idiom, and it makes it much easier to quickly throw together a networked application. But things get more complicated when you need to start changing the protocol. In some cases, you can upgrade the entire system in one fell swoop. In that case, you can just modify your protocol, install the new system, and you're off to the races.
The complicated (and more common) case is where you can't afford to upgrade the entire system at once. Then you need to deal with version mismatches. The main approach we've taken to this problem is to make components support multiple versions of a given protocol at once. To do this, we keep around the modules that describe old versions of the protocol, and keep explicit version numbers associated with each protocol module. We then write conversion functions that allow for translation between different versions of the protocol, allowing one program to speak multiple versions of a protocol. When two different components need to communicate, they first negotiate the version of the protocol they will speak to each other, picking the largest version that they both support.
This approach works reasonably well, but it has some downsides. The translation functions are somewhat tedious to write, and therefore error-prone. And even though the idea sounds simple, it's hard to get the details right. We've had to play around with a few different approaches to writing the conversion functions. One approach is to write upgrade and downgrade functions from each version to and from its successor. You can then achieve any conversion by chaining the conversion functions together. Another approach we've tried is having another set of types that are an internal model of the communication protocol, and to have conversion functions between each supported version and the model. Both approaches are workable, but each has its own advantages and disadvantages.
This is a problem that I'm sure many people have grappled with. I've just given a quick overview of how we deal with it. I'd love to see other people's comments on how they've approached the same issues.
Comments
NIH?
So you've built another rehash of Google Protocol Buffers?
Facebook's Thrift was sensible, it was written by ex-googlers, released well before Google open-sourced the original, and adds a bit of networking sugar for common cases. Cisco's etch is less excusable, it's really yet another attempt at full-on RPC magic that happens to have it's own wire format similar in aim but different from protobuf and thrift. That you don't mention any of them (even just to dismiss their suitability) is somewhat troubling.
Surely you considered using them? Your (unnamed?) system as you've described it seems to be solving the exact same problem: how to maintain a binary inter-app wire protocol with flexible typesafe concurrent versioning.
Your methodology using automatically composed version-translation functions certainly seems novel: it would let you split/tuple/collect/retype fields seamlessly unlike protobuf/thrift, at the cost of more overhead in the transition and extra code. Instead of all that protobuf just has clients see only the fields they were compiled against — it's well-suited to bigtable's EAV model, but I can see how those traits would be undesirable for more rigidly-structured data.
Was there some impediment to composing your version-translation sugar on top of one of the existing versioned binary wire protocols? Was it just that bin_prot had already been "Invented Here" and was in heavy production?
A few thoughts
Sexplib and binprot and this approach to protocol versioning were developed before Thrift and Protocol Buffers were released. If we were starting out from scratch now, we might well approach things differently. We did consider things like XDR and ASR.1, but decided they were poor fits for our needs.
That said, there are some advantages that sexplib and binprot reap by not worrying about cross-language interoperability. Rather than use some IDL that might not fit ocaml's semantics, we use OCaml type declarations themselves as the protocol spec. This makes use of such protocols quite seamless in the simple case, and unusual features of the type system, such as variants, can be handled cleanly. As far as I could tell from reading about the Thrift type system, there was inexplicably no way of encoding a variant type. (The lack of proper support for variant types in virtually all mainstream languages continues to mystify me.)
We're quite interested in understanding how other people have approached these questions. Indeed, that was part of the point of this post...
Anything's better than ASN.1
I figured you had already built your system through accretion (and had in it production) well before the flurry of press over the releases from Facebook and Google. On the other hand, protobufs were public knowledge well before that — I first learned about them from the sawzall paper in 2005.
It would have helped to have a lot more exposition in your post, you don't even name the versioning system the post is about! Does it not have a name?
Thrift may not support sum types or heterogeneous tuples, but I could give you grief just the same over binprot not handling typeclasses or bignums :)
At least for Google, their common data structure (a multidimensional sparse map) is likely worlds apart from yours — they don't have the same need for strict automatic conversion since their data is sparse anyway. They just needed to be able to add and remove fields without modifying extant clients, since their common case is just the introduction of new fields and the slow death of old ones. Because unset properties in bigtable aren't even NULL, there isn't really an issue with old clients retrieving new data that has unset fields, as it was already expected behaviour. I can see how that wouldn't work at all for your trading apps that really do need all of the fields they were compiled against.
There's something about data encoding that naturally brings out the bikeshed instinct in everyone. As a result there's a stark difference between standards that were designed intentionally upfront versus those that were iterated over time and then later standardized incidentally. Even the worst homegrown system would be leagues better than the horror that is ASN.1!
Incidental protocols aren't perfect: they tend to be built for a specific use internally and are only widely used once middle-aged, but in most cases I'll take an opinionated narrow idiom over years of bickering standardization committees. JSON is a perfect example: it has encoding flaws, and I dislike the syntax, but I have an intense appreciation for its major innovation — they avoided the standards process entirely; there was no opportunity for decision much less debate; just an admonition to "use only legal javascript literals".
Names
I figured you had already built your system through accretion (and had in it production) well before the flurry of press over the releases from Facebook and Google. On the other hand, protobufs were public knowledge well before that — I first learned about them from the sawzall paper in 2005.
Public knowledge is one thing, a usable implementation is something else entirely. Only the latter would have affected our own plans.
It would have helped to have a lot more exposition in your post, you don't even name the versioning system the post is about! Does it not have a name?
Not really. The versioning is just an idiom that we use, rather than a full-on framework.
Thrift may not support sum types or heterogeneous tuples, but I could give you grief just the same over binprot not handling typeclasses or bignums :)
There's a difference between not supporting typeclasses and not supporting variants. ML and Haskell have lots of lovely features, but there are a few basic building blocks that give you most of the advantage that you get over languages like Java, and I think variants are one of those fundamental building blocks (parametric polymorphism is another one.) The fact that variants have not become more common is deeply strange.
Supporting new types like bignums is actually easy with both sexplib and binprot --- I'm not sure if OCaml's bignums happen to be supported out of the box, but it's an easy thing to add.
Sounds reasonable
I don't think that there is any really good way to automate the conversion between messages of different versions of a protocol. In the general case, it is even possible that some messages are impossible to translate, because of missing semantics in one version of the protocol or other. However, you might find the papers on generic isomorphisms by Frank Atanassow interesting.
I'm also rather skeptical of the idea of chaining conversions (of protocol messages) and would rather go (and have gone) with the direct conversion to/from internal representation instead. The problem with chaining conversions is that each conversion is a point of failure. By chaining conversions you are likely to increase the chance that something goes wrong. It is also quite possible for a particular feature of a protocol to disappear and then be reintroduced (perhaps in a slightly altered form) after a few revisions of the protocol. Simple chained conversions cannot deal with such a case.
What you describe is basically what I would do and have done earlier (aside from chaining conversions). For example, a few months ago I built a small graphical editor for a specific purpose in F#. For persisting the data manipulated by editor, I first defined a type for the external representation of the data with (version) numbered data constructors (hinge points) allowing for future modifications. To give an idea:
Then I defined functions for converting from the editor's internal representation to the external representation and back:
To actually serialize the external representation, I wrote simple pretty printing and parsing routines using reflection that could handle all the type constructors I needed (variants, tuples, records, strings, lists and some primitive types) and used the F# syntax as the textual representation.
While working on the editor, and while it was already used for production work by others, I made several enhancements and changes to the internal and external representations of the data. To preserve the ability to read old files I never deleted any data constructors from the external rep. There was no need to be able to save to older formats (simpler problem than fully supporting multiple revisions of a protocol), so the editor always saved using the latest revisions of the external data constructors.
No good answer
I don't think there's a particularly good answer to this, but for reference this is the scheme I've used in the past to handle the same problem in database schemas. It only works in relatively simple cases.
We adopted a strictly "append-only" rule for modifying tables, which means that (once the initial schema was implemented) we would only add extra columns to a table. An extra column can modify the meaning of an existing column, but the data in the existing column is still required so that old clients can work.
As a very made-up example, it would be like having:
familyname | givenname | age
in the initial schema, and then realizing later that we needed to record middle initials:
familyname | givenname | age | initials
(old clients ignore the 'initials' column and continue to construct names using familyname & givenname).
Later we realize that storing 'age' was a stupid idea (age changes), so we add a date-of-birth column, but we have to keep 'age' for older clients, or at least until we know that all clients that still use it have been decommissioned:
familyname | givenname | age | initials | dob
Newer clients are able to make more accurate age calculations by using the 'dob' field directly. Clients adding data can calculate the deprecated age field from the date of birth at the time the data is inserted.
When you get into relations between database tables, it gets harder to modify, and you get more and nastier change-control problems for the accessor code.
an end run
Sometimes I ask in a public forum the question, What's the best way to do such and such?, and often some of the responses are to the tune of Why would you want to do that?, which I usually find annoying. So, at the risk of being annoying...
Consider that you have many components in your system and for now assume there is only one protocol spoken among these components, albeit in several versions. Call the versions A, B, C, and D, from oldest to newest. I am sure you have already found that having multiple versions in play causes severe headaches, and it only gets worse when you consider the real-world example where there are numerous protocols instead of just one.
In our example, we need any version A, B, C, D to be able talk to any other version A, B, C, D. You mentioned two approaches: (i) having a chain of maps A <-> B <-> C <-> D, and (ii) using a pseudo-version (the model) as a hub H and implementing a map from each of A, B, C, D to and from H.
Since all these maps are bidirectional it seems reasonable to call two versions equivalent if they can be mapped to one another without loss of information, and I assume these are the kinds of protocol translations you are talking about. So forgive me, here is my annoying question: If in fact the protocol versions are equivalent, why not stick with the old version? At some point, (truly) new capabilities will be added, and it will be necessary and desired to change all the relevant components at once, as a faithful translation will not be possible.
Of course, sometimes the protocol changes come from external requirements and if that's the situation you're facing, please ignore my comments. But in my experience, if you're in control of the whole system of components, it's more convenient and less error-prone to upgrade protocols only when a new feature leaves you no choice.
upgrades and downgrades
You're basically hitting on a point that I elided in my original post. Indeed, the upgrade and downgrade functions are typically not total in both directions for all versions. Usually when you have a new version, it's because there's some things in the old version that you would like to represent that you can't. Protocol upgrades are indeed painful, and we avoid them when we can.
Something like Jini?
One approach you could take is how you would do this in a JINI based system.
If your communicating pieces went through a central lookup to find their partner. Example (using bogus trading system stuff): Exector v1 asks locator "I need ticker plant v6" locator says it's at 10.1.1.1:9999
Your executor knows it can take ticker plant v6 protocol.
(In Jini it's for service versioning, but the concept still applies here)
When you need to upgrade to v7 protocol, you start an additional ticker plant (easier said than done depending of what markets you are watching and what you network infrastructure is).
In effect, for a major upgrade you deploy new v7 protocol along side v6. Last thing you do is upgrade the clients that kick off work (ie: executor).
initially, it sounds like you may need double the hardware to accomplish, but in reality you only need it for things receiving async data (ie: SIAC feed). For RPC style services, your v7 servers won't be actually called at all (because v6 clients are in use) and as you upgrade clients, v7 gets more load while v6 gets less.
side benefit to the locator concept is you don't need very complicated instance configs. you can use, add / subtract servers at will since only the locator has to keep track of the entire set. and the service providers (ie: ticker plant) just need to know how to get to the locator.
Or as was also said... treat protocol changes as something you should avoid when at all possible.
Kind of depends on what you consider to be protocol.
Wire format? (ie: key/val strings,protobuf style compact binary) Invocation signature? (ie: rpc call signature) Serialied object composition? (ie: QuoteV1.deserialize(QuoteV2.serialize()))
All of which don't matter much in how I layed it out above since the effect of that approach is to deploy a clone system and phase out the old one (which is harder when you get to external communications, ie FIX engines. But you can make the FIX engine super stupid so it doesn't need to be upgraded)
Long-term compatible protocols
Yes, this is a many-times-solved problem - though not always well solved. Seems like most everything I've in the last twenty-odd years (or more) has had a network in the middle. A few guidelines...
1. Design for differing versions on the client and server.
In some limited cases you can keep the client and server versions in sync. For the most part - over time and over growing deployments - you cannot. Assume that the client version will often differ from the server (in either direction).
(A somewhat redundant statement in this context, but a point that I needed to make far too many times.)
2. Convert at the edges.
The types inside your application may change many times (for good reason). The structures that cross the network should change slowly (if at all). In your application, convert from internal to protocol structures at the edge.
3. Version your protocol.
This could be as simple as a single version number offered by the client to the server, and the server to the client, at the beginning of the conversation. You could do finer-grained versioning, but this is often not needed.
4. Do not be over-specific in your protocol types.
When passing a number across the net, use a general number type. You application might internally use a small number type (8-bit or 16-bit), and you may think that is all you will ever need. Some of those numbers will grow. At the edge, when ingesting incoming data, check for values outside the range you can handle.
The same argument applies to strings, arrays, and the like.
5. Design for upwards and downwards compatibility.
When the client asks a question of the server, older servers may not understand the question, or return a smaller answer. When the client is older, the server may have a larger answer, or the question may be smaller. Most of the time, changes are incremental, and both server and client can handle the difference without explicit reference to the protocol version. Sooner or later someone will goof, and you will need to add a special case for a specific version.
Differences between versions can usually be kept small, and are easy to write and maintain. This is something like the approach "rwmjones" described in appending to database definitions.
This all might sound complex or difficult - but it is not.
BTW ... I do not belong on this list. :) Came in on a GMail web clip that mentioned closures (got me curious). My current iteration is server-side Java using introspection to generate and ingest JSON across the net (very small code for this). The client is either Java or Javascript (much use of closures) in the web browser. The web clip hit on a reply from the maintainer of TinyScheme - which is apparently the Scheme interpreter embedded in 4 million adware infested machines. (Slightly embarrassing, that - since I contributed slightly.)
Ajmani and Liskov, ECOOP 2006
The academic community has considered these issues somewhat in different contexts.
The most relevant multi-version work I can think of is Sameer Ajmani's thesis work that appeared in ECOOP 2006:
http://pmg.csail.mit.edu/pubs/ajmani06modular-abstract.html
Obviously we can try to map between the semantics of different versions, and that works plenty often, but sometimes it cannot be made to. The observation here is that messaging protocols (e.g., RPCs) differ from normal function call/return in that messages can fail. As such, we can/should design the system to expect such failure and code up mechanisms for recovery. This mechanism provides an opportunity: if a peer sends a message the remote peer does not understand (it is of an old version and the new version is no longer backward compatible), the receiver can simply fail and the client will be forced to recover.
Note that by "not understand" I don't mean syntax, but semantics. Suppose the server maintains a datastructure that clients can add/remove elements to, and then membership of those elements (all done via message passing). Suppose in the old version, this datastructure implemented to be a bag, but in the new version it's a set. The message format is the same in both cases, but the semantics are different. The difference can be observed by the sequences Add(1), Add(1), Remove(1), Member?(1). In the old version the last message will return true, but in the new version, it will return false. Using Ajmani's protocol, if we upgrade the server to the new version, we can allow old clients as long as the last message is made to fail, thus preserving, for that client, the old semantics.
The paper goes into more theoretical depth about building models, indicating when failure is necessary (to provide the illusion of the old semantics), etc. His paper cites other relevant papers that might be useful, too.