Parsing Market Data with Ragel, clang and GHC primops

· 6 min read

From time to time, I have the need to parse relatively small, fixed width binary messages. Like a ITCH 4.1 MoldUDP64 packet from our buddies over there at NASDAQ. Parsing should also be reasonably quick. And I’m lazy. Writing parsers by hand Maintaining handwritten parsers is no fun.

So I’m going to define our parser in Ragel, a parser generator for regular languages (think: regular expressions). It targets C/C++ and some other languages I don’t care about. Like Objective-C, D, Java and Ruby. We’ll focus on C99.

This example parser will handle a cut-down view of the ITCH spec. The full source for this post is available over on GitHub. We’re looking for order executions:

machine ITCHv41;

# Common fields for both Order Executed messages
nanoseconds           = any{4} >{ nanos    = __builtin_bswap32(*(const uint32_t*)(p)); };
orderExecutedShares   = any{4} >{ shares   = __builtin_bswap32(*(const uint32_t*)(p)); };
orderExecutedMatchNum = any{8} >{ matchNum = __builtin_bswap64(*(const uint64_t*)(p)); };
orderExecutedRefNum   = any{8} >{ refNum   = __builtin_bswap64(*(const uint64_t*)(p)); };

orderExecutedCommon = orderExecutedRefNum
                      orderExecutedShares
                      orderExecutedMatchNum;

# 4.5.1 Order Executed Message
orderExecuted = 'E' %{ type = 1; }
                nanoseconds
                orderExecutedCommon;

# 4.5.2 Order Executed With Price Message
orderExecutedPrintable = 'Y' %{ printable = true;  }
                       | 'N' %{ printable = false; };

orderExecutedPrice = any{4} >{ price = __builtin_bswap32(*(const uint32_t*)(p)); };

orderExecutedWithPrice = 'C' %{ type = 2; }
                         nanoseconds
                         orderExecutedCommon
                         orderExecutedPrintable
                         orderExecutedPrice;

main := orderExecuted | orderExecutedWithPrice;

Ragel’s compiler does a bunch of optimization and pumps out a hot mess of GOTOs that regular C compilers like gcc, icc and clang eat for breakfast every day. This parser really isn’t very exciting:

ITCH v4.1 state machine

Particularly when compared to a complete, validating ITCH parser:

Complete ITCH v4.1 parser state machine

An autovectorizing compiler can turn these state machines into machine code on par with some of the finest hand-rolled parsers. clang/LLVM does a decent job, adequate for now… but it also has some magic just hidden below the surface for Haskell developers. Namely an LLVM backend.

Typically we’d go ahead and consume parsers like these in GHC with a vanilla foreign import. But even with an unsafe import there is a relatively fixed amount of overhead due to switching calling conventions and loading out parameters. Normally this isn’t such a big deal, but we’re dealing with a lot of packets during simulation, billions and billions, and I’d like to dedicate some CPU to a task more productive than parsing.

To adopt GHC’s calling convention, we need to make our C code look enough like Haskell so they’ll play nice together. The first step is to define a function signature that looks like a regular STG function. The same thing that GHC generates. Like so:

extern void
ITCHv41_run(
    int64_t* restrict baseReg,
    int64_t* restrict sp,
    int64_t* restrict hp,
    const uint8_t* restrict buffer, // R1
    int64_t length, // R2
    int64_t r3,
    int64_t r4,
    int64_t r5,
    int64_t r6,
    int64_t* restrict spLim,
    float f1,
    float f2,
    float f3,
    float f4,
    double d1,
    double d2)

This is still a ccall function but we’ll fix that later. There is currently no way to define this as cc10 (LLVM’s internal name for GHC’s calling convention) in clang.

Step two is to jump to the return address, which lives on top of the STG stack (the sp argument)… with the desired arguments, like the results of parsing, in tow. This is a regular function call that gets converted to a tailcall later on by llc, LLVM’s native compiler, when using cc10.

// define a function pointer type that matches the STG calling convention
typedef void (*HsCall)(int64_t*, int64_t*, int64_t*, int64_t, int64_t, int64_t, int64_t,
                       int64_t, int64_t, int64_t*, float, float, float, float, double, double);

// Invoke the parser defined in our Ragel code
%% write exec;

const HsCall fun = (HsCall)sp[0];

// and then "return" our parameters as an unboxed tuple back to Haskell land
return fun(
    baseReg,
    sp,
    hp,
    type,
    nanos,
    shares,
    price,
    matchNum,
    refNum,
    spLim,
    undef,
    undef,
    undef,
    undef,
    undef,
    undef);

Compile it down to LLVM assembly…

ragel -G2 ITCHv41.rl -o ITCHv41.c
clang -O3 -emit-llvm -S ITCHv41.c -o ITCHv41.ll

And run it through sed to fix up the calling convention for the code generator… this is the magic part. Note: this is also overly general and will break any legit C calls. llc then pumps out an object file that GHC will link with later:

sed -e 's/call void/call cc10 void/; s/define void/define cc10 void/;' < ITCHv41.ll > ITCHv41.llp
llc -O3 -relocation-model=static -filetype=obj ITCHv41.llp -o ITCHv41.o

The last bit is to create a foreign primop import in Haskell. Many messages don’t fit within the 5 free registers (R2-R6) that are available here and need to be partially loaded onto the stack. In this example I’m just discarding the ‘printable’ flag to make everything fit in registers. Managing the stack is more involved. Perhaps I’ll cover it in a future post.

{-# LANGUAGE ForeignFunctionInterface #-}
{-# LANGUAGE GHCForeignImportPrim #-}
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples  #-}
{-# LANGUAGE UnliftedFFITypes #-}

foreign import prim "ITCHv41_run"
  parseITCHv41# :: Addr# -> Word# -> (# Int#, Word#, Word#, Word#, Word#, Word# #)

data ITCHv41
  = OrderExecuted
      { nanos    :: {-# UNPACK #-} !Word64
      , refNum   :: {-# UNPACK #-} !Word64
      , shares   :: {-# UNPACK #-} !Word64
      , matchNum :: {-# UNPACK #-} !Word64 }
  | OrderExecutedWithPrice
      { nanos    :: {-# UNPACK #-} !Word64
      , refNum   :: {-# UNPACK #-} !Word64
      , shares   :: {-# UNPACK #-} !Word64
      , matchNum :: {-# UNPACK #-} !Word64
      , price    :: {-# UNPACK #-} !Word64 }
  | OtherMessage
      { status   :: {-# UNPACK #-} !Int64 }

-- | invoke the parser primop and allocate a record with the results
parseITCHv41 :: Ptr Word8 -> Word -> ITCHv41
parseITCHv41 (Ptr buffer) (W# length) = case parseITCHv41# buffer length of
  (# 1#, nanos, shares,     _, matchNum, refNum #) ->
     OrderExecuted (W64# nanos) (W64# refNum) (W64# shares) (W64# matchNum)

  (# 2#, nanos, shares, price, matchNum, refNum #) ->
     OrderExecutedWithPrice (W64# nanos) (W64# refNum) (W64# shares) (W64# matchNum) (W64# price)

  (# status, _,      _,     _,        _,      _ #) ->
     OtherMessage  (I64# status) -- insert other error handling here

Update :

I spent an hour hacking together a simple benchmark suite and posted the results. If you’ve kept reading this far you might find it interesting.

Comments

7 comments · 3 replys · 5 participants

  1. Avatar for Joel Reymont
    Joel Reymont Permalink to comment 538691889

    I know you guys are a Haskell shop but what's wrong with using Erlang for parsing data feeds?

    1. Avatar for Nathan Howell
      Nathan Howell Permalink to comment 538698466

      If you prefer Erlang vs SomeOtherLanguage and it meets your performance criteria... go for it. I'm not very experienced in Erlang so I won't comment on how well it would work this particular problem.

  2. Avatar for datalligator
    datalligator Permalink to comment 2733742508

    This seems to not work at all with ghc 8.0.1 and llvm 3.7 -- not sure what's changed but jump/cc10 call to Sp[0] give a bus error.

    1. Avatar for Nathan Howell
      Nathan Howell Permalink to comment 2734162499

      It's quite possible that the layout or primop ffi has changed. I haven't had the chance to mess around with LLVM and GHC 8.0 yet.

      1. Avatar for datalligator
        datalligator Permalink to comment 2735634088

        Yes indeed I'm going to have a dig into that and see what's happening -- seems to me that the the top of the stack is not what's expected when we enter the foreign primop.

  3. Avatar for Brandon Simmons
    Brandon Simmons Permalink to comment 3058331388

    Is it possible to do this (use `foreign import prim` for a C function) using the native backend? I tried adapting the above but get a segfault, and I don't really know what I'm doing. EDIT: posted an SO question here if anyone wants internet points: http://stackoverflow.com/qu...