Ten Cache Misses

Crushing Haskell like a Tin Can

Generics and Protocol Buffers: The Hackage Years

Last year I spent some time exploring GHC.Generics as a language for describing Protocol Buffers messages. Steve and I pushed just hard enough to get a real implementation out the door and it’s finally available on Hackage.

This package avoids the typical/traditional Protocol Buffers flow of defining messages in a .proto file and running a preprocessor to generate code in some target language(s). Instead, we’ll define messages in Haskell and generate .proto files for interoperability 1. A skeleton of a protoc plugin is starting to take shape too.

There are a couple quirks when using it today. The main downside (imo) is the dependency on the type-level package. Come GHC 7.8.1 it should be possible to switch to GHC.TypeLits. I’d also like to provide a more seamless path for mapping existing datatypes to a Protocol Buffers message.

It’s an early release but please check it out, kick the tires a bit, and let me know what you think. Send pull requests and track issues on Github.

The current syntax differs slightly from the original blog post. Encoding and decoding is still performed using cereal. More comprehensive docs and samples are available on Hackage or in the git repo.

Given a boilerplate module:

1
2
3
4
5
6
7
{-# LANGUAGE DeriveGeneric, OverloadedStrings #-}
import Data.Int
import Data.ProtocolBuffers
import Data.TypeLevel (D1, D2, D3, D4)
import Data.Text
import Data.Word
import GHC.Generics (Generic)

We can define some messages. Fields are defined using type functions such as Required, Optional, Repeated and Packed. A type-level number (D1, D2 .. Dn) defines the field tag. And the encoding style is selected with Value (for scalars and strings), Enumeration or Message.

Scalar encoding will default to the traditional varint format unless you choose otherwise: Value (Fixed a) (fixed-width) and Value (Signed a) (zz-encoded) forms are supported for integers.

A basic message might contain a bunch of values:

1
2
3
4
5
6
data Simple = Simple
  { field1_a :: Required D1 (Value Int64) -- ^ The last field with tag = 1
  , field2_a :: Optional D2 (Value Text) -- ^ The last field with tag = 2
  , field3_a :: Repeated D3 (Value Bool) -- ^ All fields with tag = 3, ordering is preserved
  , field4_a :: Packed D4 (Value Word32) -- ^ A packed sequence, ordering is preserved
  } deriving (Generic, Show)

Or we can define some regular Haskell enums and reference other messages:

1
2
3
4
5
6
7
8
9
10
data Color
  = Red
  | Green
  | Blue
    deriving (Enum, Show)

data Complex = Complex
  { field1_b :: Optional D1 (Enumeration Color) -- ^ This field is converted using Enum
  , field2_b :: Required D2 (Message Simple) -- ^ An embedded message
  } deriving (Generic, Show)

And encode them to ByteStrings:

1
2
3
4
5
6
7
8
9
runPut . encodeMessage $ Complex
  { field1_b = putField Green
  , field2_b = putField Simple
      { field1_a = putField 42
      , field2_a = putField "some text"
      , field3_a = putField [True, True, False, False]
      , field4_a = putField [1..10]
      }
  }

Encoding is basically just the opposite: runGet, decodeMessage and getField are the tools of choice.

  1. Eventually, at least. A proof of concept code generator is included but not yet functional.

Comments