Ten Cache Misses

Crushing Haskell like a Tin Can

Parsing Market Data with Ragel, clang and GHC primops

From time to time, I have the need to parse relatively small, fixed width binary messages. Like a ITCH 4.1 MoldUDP64 packet from our buddies over there at NASDAQ. Parsing should also be reasonably quick. And I’m lazy. ~~Writing parsers by hand~~ Maintaining handwritten parsers is no fun.

So I’m going to define our parser in Ragel, a parser generator for regular languages (think: regular expressions). It targets C/C++ and some other languages I don’t care about. Like Objective-C, D, Java and Ruby. We’ll focus on C99.

This example parser will handle a cut-down view of the ITCH spec. The full source for this post is available over on GitHub. We’re looking for order executions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
machine ITCHv41;

# Common fields for both Order Executed messages
nanoseconds           = any{4} >{ nanos    = __builtin_bswap32(*(const uint32_t*)(p)); };
orderExecutedShares   = any{4} >{ shares   = __builtin_bswap32(*(const uint32_t*)(p)); };
orderExecutedMatchNum = any{8} >{ matchNum = __builtin_bswap64(*(const uint64_t*)(p)); };
orderExecutedRefNum   = any{8} >{ refNum   = __builtin_bswap64(*(const uint64_t*)(p)); };

orderExecutedCommon = orderExecutedRefNum
                      orderExecutedShares
                      orderExecutedMatchNum;

# 4.5.1 Order Executed Message
orderExecuted = 'E' %{ type = 1; }
                nanoseconds
                orderExecutedCommon;

# 4.5.2 Order Executed With Price Message
orderExecutedPrintable = 'Y' %{ printable = true;  }
                       | 'N' %{ printable = false; };

orderExecutedPrice = any{4} >{ price = __builtin_bswap32(*(const uint32_t*)(p)); };

orderExecutedWithPrice = 'C' %{ type = 2; }
                         nanoseconds
                         orderExecutedCommon
                         orderExecutedPrintable
                         orderExecutedPrice;

main := orderExecuted | orderExecutedWithPrice;

Ragel’s compiler does a bunch of optimization and pumps out a hot mess of GOTOs that regular C compilers like gcc, icc and clang eat for breakfast every day. This parser really isn’t very exciting:

Particularly when compared to a complete, validating ITCH parser:

An autovectorizing compiler can turn these state machines into machine code on par with some of the finest hand-rolled parsers. clang/LLVM does a decent job, adequate for now… but it also has some magic just hidden below the surface for Haskell developers. Namely an LLVM backend.

Typically we’d go ahead and consume parsers like these in GHC with a vanilla foreign import. But even with an unsafe import there is a relatively fixed amount of overhead due to switching calling conventions and loading out parameters. Normally this isn’t such a big deal, but we’re dealing with a lot of packets during simulation, billions and billions, and I’d like to dedicate some CPU to a task more productive than parsing.

To adopt GHC’s calling convention, we need to make our C code look enough like Haskell so they’ll play nice together. The first step is to define a function signature that looks like a regular STG function. The same thing that GHC generates. Like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
extern void
ITCHv41_run(
    int64_t* restrict baseReg,
    int64_t* restrict sp,
    int64_t* restrict hp,
    const uint8_t* restrict buffer, // R1
    int64_t length, // R2
    int64_t r3,
    int64_t r4,
    int64_t r5,
    int64_t r6,
    int64_t* restrict spLim,
    float f1,
    float f2,
    float f3,
    float f4,
    double d1,
    double d2)

This is still a ccall function but we’ll fix that later. There is currently no way to define this as cc10 (LLVM’s internal name for GHC’s calling convention) in clang.

Step two is to jump to the return address, which lives on top of the STG stack (the sp argument)… with the desired arguments, like the results of parsing, in tow. This is a regular function call that gets converted to a tailcall later on by llc, LLVM’s native compiler, when using cc10.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// define a function pointer type that matches the STG calling convention
typedef void (*HsCall)(int64_t*, int64_t*, int64_t*, int64_t, int64_t, int64_t, int64_t,
                       int64_t, int64_t, int64_t*, float, float, float, float, double, double);

// Invoke the parser defined in our Ragel code
%% write exec;

const HsCall fun = (HsCall)sp[0];

// and then "return" our parameters as an unboxed tuple back to Haskell land
return fun(
    baseReg,
    sp,
    hp,
    type,
    nanos,
    shares,
    price,
    matchNum,
    refNum,
    spLim,
    undef,
    undef,
    undef,
    undef,
    undef,
    undef);

Compile it down to LLVM assembly…

1
2
ragel -G2 ITCHv41.rl -o ITCHv41.c
clang -O3 -emit-llvm -S ITCHv41.c -o ITCHv41.ll

And run it through sed to fix up the calling convention for the code generator… this is the magic part. Note: this is also overly general and will break any legit C calls. llc then pumps out an object file that GHC will link with later:

1
2
sed -e 's/call void/call cc10 void/; s/define void/define cc10 void/;' < ITCHv41.ll > ITCHv41.llp
llc -O3 -relocation-model=static -filetype=obj ITCHv41.llp -o ITCHv41.o

The last bit is to create a foreign primop import in Haskell. Many messages don’t fit within the 5 free registers (R2-R6) that are available here and need to be partially loaded onto the stack. In this example I’m just discarding the ‘printable’ flag to make everything fit in registers. Managing the stack is more involved. Perhaps I’ll cover it in a future post.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{-# LANGUAGE ForeignFunctionInterface #-}
{-# LANGUAGE GHCForeignImportPrim #-}
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples  #-}
{-# LANGUAGE UnliftedFFITypes #-}

foreign import prim "ITCHv41_run"
  parseITCHv41# :: Addr# -> Word# -> (# Int#, Word#, Word#, Word#, Word#, Word# #)

data ITCHv41
  = OrderExecuted
      { nanos    :: {-# UNPACK #-} !Word64
      , refNum   :: {-# UNPACK #-} !Word64
      , shares   :: {-# UNPACK #-} !Word64
      , matchNum :: {-# UNPACK #-} !Word64 }
  | OrderExecutedWithPrice
      { nanos    :: {-# UNPACK #-} !Word64
      , refNum   :: {-# UNPACK #-} !Word64
      , shares   :: {-# UNPACK #-} !Word64
      , matchNum :: {-# UNPACK #-} !Word64
      , price    :: {-# UNPACK #-} !Word64 }
  | OtherMessage
      { status   :: {-# UNPACK #-} !Int64 }

-- | invoke the parser primop and allocate a record with the results
parseITCHv41 :: Ptr Word8 -> Word -> ITCHv41
parseITCHv41 (Ptr buffer) (W# length) = case parseITCHv41# buffer length of
  (# 1#, nanos, shares,     _, matchNum, refNum #) ->
     OrderExecuted (W64# nanos) (W64# refNum) (W64# shares) (W64# matchNum)

  (# 2#, nanos, shares, price, matchNum, refNum #) ->
     OrderExecutedWithPrice (W64# nanos) (W64# refNum) (W64# shares) (W64# matchNum) (W64# price)

  (# status, _,      _,     _,        _,      _ #) ->
     OtherMessage  (I64# status) -- insert other error handling here

Update 5/23/2012:

I spent an hour hacking together a simple benchmark suite and posted the results. If you’ve kept reading this far you might find it interesting.

Comments