Encoding and decoding C structures in Scala with scodec

During the process of writing the Rust implementation of my Phantasy Star Online server, I encountered some issues which are best described as "not DRY" (don't repeat yourself), particularly in the area of parsing and encoding data. It turns out, this is so time consuming that it's almost demoralizing when there are so many other interesting things to be working on.

My second attempt at this server is written in Scala. Some people might be reeling back; "a JVM language? but what about byte buffers/memory usage/garbage collection?" Yeah, people in server emulation tend not to like these things, but I don't care; I like breaking trends. In practicality, I haven't really encountered problems related to any of these. But let's talk about manipulating byte buffers, because I'm avoiding this problem entirely by using a wonderful little library called scodec by Michael Pilquist.

scodec is a parser combinator library for Scala that enables one to write complex two-way codecs from a Scala representation (a tuple, a case class, whatever you need it to be) into a bit string, by combining simpler primitives into a tree-like structure. It relies heavily on Scala's type system and operates on immutable BitVectors (which, internally, may be represented by a rope of byte sequences).

Let's look at an example: a simple (packed) C structure.

// Assume everything is little endian and packed.
typedef struct {  
    int16_t a;
    float   b;
} msg_t;

This is trivially represented in Scala as (Int, Float) (we're going to use Int instead of Short because it's easier to work with in-language), but if we wanted to represent it as the literal in-memory representation that a C programmer might take advantage of, usually you have to do a lot of verbose reader-writer code. Not with scodec!

import scodec.Codec  
import scodec.codecs._

val msgCodec: Codec[(Int, Float)] = int16L ~ floatL

msgCodec.encode((16, 0.0f)).require  

Just like that, msgCodec is a Codec that decodes a BitVector into a (Int, Float), and similarly, encodes an (Int, Float) into a BitVector. The ~ function on Codec (int16L) enables you to compose two codecs into a single codec that produces a tuple of the two codec's generic parameters. From the naming of the primitive codecs, you know that these will encode and decode little endian values, regardless of the host platform's endianness.

Now, this isn't tremendously useful. Hardly anyone uses tuples to store data in Scala, for a variety of reasons. Let's use a case class, since it's more idiomatic.

case class Msg(a: Int = 0, b: Float = 0.0f)  
object Msg {  
  val codec: Codec[Msg] = {
    int16L :: floatL
  }.as[Msg]
}

Msg.codec.encode(Msg()).require  

This takes advantage of a feature of the shapeless Scala library called HList, or heterogenous list. This is essentially a statically-typed list known at compile time, and this gives us some nice properties. The :: function on Codec, like the ~ function, allows us to compose codecs together, but in this case an HList of the types is produced and consumed. The exact type is Codec[::[Int, ::[Float, HNil]]], but Scala has infix generic types for two-arity generics, so we can describe the inner type as Int :: Float :: HNil, where ::[H, T] is shown as H :: T right associative. The as[T] function on Codec uses an implicit parameter inferred by the compiler that automatically cross-maps the HList to the apply/unapply parameters of the companion object for T. In turn, this means that scodec is able to produce a Codec[Msg] from Codec[Int :: Float :: HNil]. This is extremely powerful because now we can pattern match and copy on our new case class representing this C structure. If the composed HList codec does not correctly map to the case class, the code will fail compilation automatically, so you will know for sure when the codec is built correctly.

There's one last basic feature one should be aware of when designing codecs: the associated context string. If an encode or decode results in a failure, scodec will produce a context string that helps identify where in the process a decode failed. Specifying these is very simple:

var myCodec = ("a" | int16L) ~ ("b" | floatL)  

The | extension function on String enables you to provide a named context to any step of the composed codec. They will be used in the attempt messages if the codec fails for any reason. In a larger non-trivial situation, you may have context strings that end up like this:

message/body/inventory/item 0/data  

Now you can figure out exactly where encoding or decoding failed.

I've been very happy with scodec and I find it to be an invaluable asset of the Scala ecosystem. There are plenty of more complicated combinators (like conditional codecs, unit codecs, the discriminator codec builder), but the simplest primitives are alone very powerful in producing easily-maintainable code. It should be noted that these work on bit strings, so even for complex protocol specifications like TCP/IP, you can encode and decode into forms that are much easier to use in Scala code directly (a sequence of named Boolean for some bit flags! No more manually handling that). If performance becomes a problem, you can also drop down to a more basic API and write your own Codec.