Why C still matters in 2013: a simple example

EDIT: In the original post, It wasn’t clear that I intended to perform modulo 2^64 computations — I certainly understand these aren’t “correct” Fibonacci numbers: I’m not analyzing bignum, but instead exploring compiler codegen and computer architecture.

Recently I’ve been hacking on Dynvm, a generic run-time for dynamic languages. As with any good langauge-runtime project, development is benchmark-driven. Thus, I’ve been benchmarking various algorithms in different languages to get a feel for the typical speeds. One classic test is the iterative computation of Fibonacci numbers. For simplicity, I coded up this algorithm modulo 2^64 in various languages.

In Python:

def fib(n):
    SZ = 2**64
    i = 0
    a, b = 1, 0
    while i < n:
        t = b
        b = (b+a) % SZ
        a = t
        i = i + 1
    return b

In C:

#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;

int main(int argc, char *argv[])
{
    ulong n = atoi(argv[1]);
    ulong a = 1;
    ulong b = 0;
    ulong t;

    for(ulong i = 0; i < n; i++) {
        t = b;
        b = a+b;
        a = t;
    }

    printf("%lu\n", b);
    return 0;
}

Others code samples are on github.

Dynvm has a benchmarking framework that allows comparison between different languages. With n=1,000,000 on an Intel i7-3840QM (power-scaled to 1.2 GHz):

=======================================================
  Language                           Time (in sec)
=======================================================
  Java                               0.133
  C Language                         0.006
  CPython                            0.534
  Javascript V8                      0.284

Clearly C dominates here, but the Java numbers are misleading as most of the time is due to the JIT compiler startup cost (~120 ms). At n=100,000,000 the results become much clearer:

=======================================================
  Language                           Time (in sec)
=======================================================
  Java                               0.300
  C Language                         0.172
  CPython                            47.909
  Javascript V8                      24.179

Here we get a glimpse why C still matters in 2013 and why the coding world won’t completely jump ship to Python or V8/Node. Sometimes you need raw performance, and dynamic langauges still struggle with even this very simple example. Now, I personally believe this can be overcomeThere is much hope generated by project like: JVM, V8, PyPy, LuaJIT, and others, but in 2013 we are not yet there.

However, this does beg the question: Why is the difference so drastic? Between C and Python there is a 278.5X performance gap! The most shocking part here is that the C and Python inner loops are nearly identical from a syntax perspective.

To leave no stones unturned, I dusted off my good ‘ole disassembler, and found this:

0000000000400480 <main>:
247   400480:       48 83 ec 08             sub    $0x8,%rsp
248   400484:       48 8b 7e 08             mov    0x8(%rsi),%rdi
249   400488:       ba 0a 00 00 00          mov    $0xa,%edx
250   40048d:       31 f6                   xor    %esi,%esi
251   40048f:       e8 cc ff ff ff          callq  400460 <strtol@plt>
252   400494:       48 98                   cltq
253   400496:       31 d2                   xor    %edx,%edx
254   400498:       48 85 c0                test   %rax,%rax
255   40049b:       74 26                   je     4004c3 <main+0x43>
256   40049d:       31 c9                   xor    %ecx,%ecx
257   40049f:       31 f6                   xor    %esi,%esi
258   4004a1:       bf 01 00 00 00          mov    $0x1,%edi
259   4004a6:       eb 0e                   jmp    4004b6 <main+0x36>
260   4004a8:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
261   4004af:       00
262   4004b0:       48 89 f7                mov    %rsi,%rdi
263   4004b3:       48 89 d6                mov    %rdx,%rsi
264   4004b6:       48 83 c1 01             add    $0x1,%rcx
265   4004ba:       48 8d 14 3e             lea    (%rsi,%rdi,1),%rdx
266   4004be:       48 39 c8                cmp    %rcx,%rax
267   4004c1:       77 ed                   ja     4004b0 <main+0x30>
268   4004c3:       be ac 06 40 00          mov    $0x4006ac,%esi
269   4004c8:       bf 01 00 00 00          mov    $0x1,%edi
270   4004cd:       31 c0                   xor    %eax,%eax
271   4004cf:       e8 9c ff ff ff          callq  400470 <__printf_chk@plt>
272   4004d4:       31 c0                   xor    %eax,%eax
273   4004d6:       48 83 c4 08             add    $0x8,%rsp
274   4004da:       c3                      retq
275   4004db:       90                      nop

The key part here is the inner loop that computes the next Fibonacci number:

262   4004b0:       48 89 f7                mov    %rsi,%rdi
263   4004b3:       48 89 d6                mov    %rdx,%rsi
264   4004b6:       48 83 c1 01             add    $0x1,%rcx
265   4004ba:       48 8d 14 3e             lea    (%rsi,%rdi,1),%rdx
266   4004be:       48 39 c8                cmp    %rcx,%rax
267   4004c1:       77 ed                   ja     4004b0 <main+0x30>

The register assignment of variables is as follows:

a:  %rsi
b:  %rdx
t:  %rdi
i:  %rcx
n:  %rax

Lines 262 and 263 implement the variable swap, line 264 increments the loop counter, and though odd-looking, line 265 performs b=a+t. Then a simple compare is done, and a jmp back to the top to continue.

Hand-decompiled, the code looks like this:

loop:   t = a
        a = b
        i = i+1
        b = a+t
        if(n > i) goto loop

 

The entire inner-loop is implemented in just six x86-64 instructions (and likely a similair number of internal micro-ops). This is a big win over the CPython interpretation model were 6+ instructions are likely required for each high-level bytecode operation! However, there are effects here that are much more subtle.

One of the major slowdowns in modern processors is access to main memory. This effect is so dire, that in microprocessor design, countless engineering hours have been poured into finding effective techniques to “hide” memory delays. Common strategies include: caching, speculative prefetching, load-store dependence prediction, and executing instructions out-of-order. These approaches go a long way to make machines fast, but nothing beats removing memory accesses altogether.

In the assembly code above, memory is never accessed. Even the variables are completely resident in CPU registers. Now consider CPython: everything is an object on the heap, and all methods are dynamic. The dynamic nature is so pervasive that we have no way to determine if a+b will execute integer_add(a, b), string_concat(a, b), or even a user defined function! This means that a lot of time is spent figuring it out at runtime. Dynamic JIT runtimes attempt to learn this information at runtime and generate x86 code on-the-fly, but this is not always straight-forward (I expected the V8 run-time to fare much better, but oddly it ran at about 0.5X the speed of Python).  Since CPython is a pure interpretor, on every loop iteration, a lot of time is spent resolving the dynamic properties, requiring many unnecessary memory accesses.

In addition to memory tricks, modern super-scalar out-of-order cores fetch several instructions into the machine and execute them “when most convenient”, that is: in any order the machine likes, given it is still correct. These machine can also execute multiple instructions in parallel during the same clock cycle, if the instructions are independent. The Intel Sandy Bridge CPU can re-order 168 instructions at a time and can issue — begin execution of — up to 6 instructions and retire — complete execution of — up to 4 instructions every cycle! Crudely using the Fibonacci example above, this machine can re-order instructions across 28 inner loops and complete almost an entire loop every cycle! This all sounds great, but as always the devil is in the details.

This efficiency assumes that the 8 instructions are independent and that we can actually fetch enough instructions into the core to take advantage of the re-ordering. The latter requirement is complicated in the presence of branches, that is if-else and loops. The typical approach is branch prediction. The CPU will dynamically use previous branch outcomes to guess future outcomes and fetch the instructions it thinks will be executed. However, this speculation can be incorrect and when it is, the CPU goes into recovery mode, often flushing out the fetched instructions and restarting. This recovery can have a big effect on performance. Thus, correctly predicting branches is another area where many hours of engineering are invested.

Now, not all branches are created equal: some are easy to predict perfectly and others are hardly possible to predict at all. Backward loop branches — like in line 267 of the disassembly — are among the easiest to predict. This branch jumps backwards 100,000,000 times consecutively.

A hard-to-predict branch can be found in the code below:

void main(void)
{
    for(int i = 0; i < 1000000; i++) {
         int n = random();
         if(n >= 0) {
            printf("positive!\n");
        } else {
            printf("negative!\n");
        }
    }
}

If random() is truely random (far from true in C), than the if-else can not be predicted any better than a guess. Fortunately, most branches are not this bad, but few are as trivial as backwards loop branches.

Back to our example: the C code for Fibonacci results in only one very easy-to-predict branch. On the other-hand, the CPython code will be much worse. First, all pure interpreters have a “dispatch” loop, something like:

void exec_instruction(instruction_t *inst)
{
    switch(inst->opcode) {
        case ADD:    // do add
        case LOAD:   // do load
        case STORE:  // do store
        case GOTO:   // do goto
    }
}

No matter how the compiler handles this, at least one indirect branch is required, which will be tricky to predict.

Next, recall that dynamic languages have to figure out things as basic as “what does ADD mean?” at runtime. This of course requires — you guessed it — more tricky branches.

All of this adds up and in the end we have a 278.5X performance gap!  Now, this is certainly a simple example, but it only gets trickier from here. This single example is enough to motivate the importance of low-level static languages (like C) in modern software stacks. I’m certainly not the biggest fan of C in 2013, but it certainly still dominates its niche of low-level control and performant applications.

38 thoughts on “Why C still matters in 2013: a simple example

  1. Edgar Cabrera

    Thank you for the post. I don’t have the data to support it, but I believe that C is still the most important language in embedded systems, for obvious reasons. Of course that in web development (and I use web development for my argument because it is probably to be the most contrasting environment to embedded systems) it doesn’t seem to be as important because interpreted languages seem to work ‘fast enough’, but I believe C can be quite important in the web because of two reasons:

    * Huge amounts of data and huge amounts of requests. Speed is particularly important here, and no matter how much you hack your favorite framework (django, rails, express, etc.) you can’t beat a difference in performance of, as mentioned here, 278.5X.
    * Increasing importance of more complex mathematical operations. Machine learning, big data, NLP and many other terms are gaining relevance in web applications, and most of them demand a lot of computational power. C can do a pretty good job to help here.

    Of course, I’m not saying that people should start building entire webapps in C. It’s possible but probably not very convenient. I think the best solution is to implement the most time consuming operations in C (when it makes sense to do it), and still use a web framework for unifying the whole application. That’s one of the reasons why I think C is still relevant and will be for some years more. This article proves that when it comes to speed, there’s no much point in avoid using it.

    Also, pointers are fun.

    1. ajbonkoski@gmail.com Post author

      Good points here!

      I’m certainly not suggesting C should be the primary language for many tasks. However, performance is certainly a major concern as you begin scaling an application. Investing the engineering time to port from a dynamic language to C/C++ for performance critical inner loops is still a huge win, despite C’s lack of contemporary abstractions. The industry can hate on C/C++, but this usage will continue until higher-level languages can actually compete in this space.

      1. ajbonkoski@gmail.com Post author

        I don’t. This is one of the annoying things about JS…
        The options are either “foreign function interface”, bigint, modulo 2^32, or modulo 2^64 and accept the 53-bit precision.

        For my purposes here, its negligible: the JS performance appears the same in each case.
        For example “b = (b+a) % (1<<20)” takes the same time as “b = (b+a) % Math.pow(2**64);”
        The arithmetic size doesn’t appear to be the bottle-necks. Although I did expect the V8/Node performance to approach the speed of C (since its a JIT and this is an “easy” piece of code)…

  2. Steve Hernandez

    I just wanted to say that this is very nice analysis. I learned to code in ‘C’ first, and it is still a skill that I look for in all the programmers and developers that I hire. I could care less if they know Ruby/Python/Flavor-of-the-week, as long as they have a solid foundation in C/C++.

    1. ajbonkoski@gmail.com Post author

      The intent was a simple analysis of the execution environment between different languages… driven directly from curiosity. I do not intend to build a detailed database of benchmarks, but instead to explore and experiment. If you desire those detailed benchmarks, this is the wrong place for you. Also, I encourage you to develop and publish the benchmarks you seek, instead of begging others to do the work for you ;-)

      1. Isaac Gouy

        > I do not intend to build a detailed database of benchmarks…

        I haven’t suggested that you should; I suggested you compare programs that do something real.

        > Between C and Python there is a 278.5X performance gap!

        For a 20 line program that should be a table lookup. Whoopee!

        > Also, I encourage you to develop and publish the benchmarks you seek, instead of begging others to do the work for you

        From those words, I guess you already know that for years I’ve developed and published the benchmarks game — to keep the world safe from dumb fib comparisons ;-)

        1. CAI

          Why would fib be a bad benchmark? Isn’t any program a valuable benchmark? The 60m dash hasn’t been supplanted by the 10,000m.

          1. ajbonkoski@gmail.com Post author

            Well, its a very “low-hanging fruit” benchmark. There is a lot that isn’t tested here.

  3. Hank

    Aren’t you just overflowing that unsigned long like mad? Here’s a program I wrote in C that actually prints the correct result for n=1000000 fib:

    http://www.ralree.com/2009/09/09/1000000th-fibonacci-number-one-liner-in-c/

    Not I had to use gmp to do that. As long as you have that and gcc installed, the “one-liner” works (just tested it). It runs in about 30 seconds on my server.

    Here’s the number it should print: http://pastebin.com/yvqhUs10

    Are you overflowing an integer in the other languages too? I think python might actually work for this since it uses a bigint implementation.

  4. someguy

    I think the future for web is Go (Golang). You get high performance, baked in concurrency model, static typing, and super fast compiled language that is very concise and feels more like hybrid between high level dynamic languages and C. IMO its better than Java and nodejs/python/ruby… look like toys in compare).

    1. ajbonkoski@gmail.com Post author

      I’ve heard a lot of great things about Go, but haven’t had a chance to do anything with it yet. I’m currently working with the D language lately (Dynvm is written in it).

  5. Joa Ebert

    Would you mind sharing the JS code? I wonder why V8 performs so bad here. The only reason I could think of is that you make a case for 64- bit integers which are boxed in V8 (if that is still true). And as someone already mentioned: your benchmark is broken. I wonder what point you want to prove? This is a microbenchmark at best, far from being useful, computing false results.

    1. ajbonkoski@gmail.com Post author

      On github: https://github.com/ajbonkoski/dynvm/tree/master/lang/fib
      Let me know if you figure out the main slowdown in V8/Node; I didn’t have to look into it… I expected it to be much better.
      Also, the 64-bit integer overflow is intentional (added clarification to post).
      I’m not really trying to prove any point here: if I was, I’d write an actual paper. The content here is the result of being shocked at these performance numbers and wondering why this was happening…

      1. Joa Ebert

        Thanks for sharing. However I cannot reproduce your results. The JavaScript and C code produce different values for me. And if the result is not the same, we could as well just compare to completely different programs anyways.
        But even using your code with a D8 release build I get 1.6sec which is far off from the 24sec value in your post. Of course, my computer is different than yours, node is not d8, but that number smells fishy.

        I also implement the benchmark for 32-bit signed integers. The result for the JavaScript and C version is equal and the performance difference is what you would expect. JavaScript takes 500ms, and the clang -O3 results in 100ms for me. The gist contains the actual source and generated assembly code. https://gist.github.com/joa/6389914

        1. ajbonkoski@gmail.com Post author

          Thanks for this. I’ll have to look into this more. I suspected there was some issue with my JS numbers. Perhaps I can now write a follow-up where I explore the D8 x86-64 code!

  6. Nathan Kurz

    Nice article, and good intro. Sorry it didn’t get traction on HN — I submitted it as well.

    A couple quibbles: Sandy Bridge has 6 execution units instead of the 8 that Haswell has. Only 3 are used for logic and arithmetic operations, so without using SIMD you’ll be limited to 3 data manipulations per cycle. You can add two loads and a store to this per cycle, but you can only retire 4 ops per cycle steady state. Also, your text description for the LEA is a little off, should be “b = a + t” as you have in the figure. And contrary to the figure, ‘ja’ should translate as “jump after”: if (n > i) goto LOOP;

    Since you’re working on a virtual machine, you probably know this, but “First, all pure interpreters have a ‘dispatch’ loop, something like…” is bit out of date. There are a number of techniques that improve branch prediction. The easiest to to put a mini-dispatch at the end of each opcode handler, so that patterns in opcodes can be better predicted (each gets its own prediction table).

    Python got a boost switching to this a few years ago: http://bugs.python.org/issue4753
    Good overview with pictures of the approach here: http://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-berndl.pdf
    There are some potentially faster less explored techniques as well:
    http://www.cs.toronto.edu/~matz/dissertation/matzDissertation-latex2html/node6.html

    Keep writing!

    1. ajbonkoski@gmail.com Post author

      Thanks for the feedback. I’ll make some of these fixes!

      “Since you’re working on a virtual machine, you probably know this…”
      I’m new to the area, so the osmosis is still in full effect, but yes my example is definitely the naive approach….
      However, even with mitigations the software dispatch still hurts especially verses the hardware dispatch you get with a JIT.

      “Keep writing!”
      Thanks! I’m new to the whole blogging thing, its a lot of fun.
      I’m also realizing that the internet is a pretty brutal and unfiltered form of peer review: although not all feedback is created equal…

  7. dodgy_coder

    Yes C still matters, especially in embedded work. It was only last year that one of the products my company makes finally got rid of its dependence on an old assembler codebase (on one particular microcontroller) … its now 100% in C. C has both high level and low level features, which makes it ideal when real time speed and performance is required.

  8. thodu

    Do check out Julia – http://julialang.org/

    the second run (Julia is JIT compiled and hence the first run takes longer) of

    function fib()
    a::Int128 = 1
    b::Int128 = 0
    i::Int128 = 0

    while i @time fib()
    78399324664570641339289480035983813691
    elapsed time: 0.150133585 seconds (256 bytes allocated)

    while the C code on my machine takes 0.267 seconds.

    Having said that the Julia time does not include its startup time, reducing which is on the to-do list of the developers.

    1. ajbonkoski@gmail.com Post author

      @thodu I’ve heard a lot about Julia recently (yet, another language on my list to check out)
      This result is curious. It appeared to me that the x86-64 above is essentially optimal (perhaps except for loop unrolling). Are you sure you ran the C compiler with full optimizations enabled?

      Also, I wouldn’t mind adding some Julia code to the DynVM testsuite ;-)

  9. Pierre

    C language is outdated and only useful for neck beard types who waste their time with command line programs. Meanwhile the world has moved on, and Javascript is now the system program language of choice. Hardware today is so fast that C is not necessary for most tasks. Worse, C programmers have very limited understanding of advanced software architectures.

    1. ajbonkoski@gmail.com Post author

      What an incredibly short-sighted comment! Try telling this the the scientists (here and here) who develop long-running numerical simulations, where a speed-up of 20% might mean they save a whole week! Try telling OS and embedded developers to write JS to interact with hardware (who writes the VM?) Or try telling a large web service with 1000s of servers after they’ve cut their energy bill in half!

    2. Makoto

      Apparently, Linux developers didn’t get this memo since they are still heavily developing in C.

  10. Uncompetative

    Your last code example overlooks a simple optimisation. Rather than use a CASE to decode bytecodes and dispatch their arguments to the appropriate C function implementations, you replace the bytecodes by BLs (i.e. Branch with Link instructions to leaf subroutines that optionally push the register state to data stack and the value in the ARM’s link register into a separate return stack, either of limited depth, or growing from the other end of the memory map), when the code is downloaded, or when run for the first time if originating on that machine – supposing that platform independence is a requirement. Each branch directly calls assembly code implementing that Opcode which may pop its arguments from the top of the data stack and push back its results. Stack based VMs are slower than Register based VMs, but a CISC approach to Opcodes, such as providing a built-in Fibonacci function via an Opcode means that it would spend most of its time within that inner loop just as C does. End user compiled extensions can be made to these CISC Opcodes with those that aren’t ‘frozen’ calling assembly code that has more room around it to expand and contract, or exploratory ‘melted’ bytecodes with the opportunity for rapid application development through live programming. None of this is new, of course:

    http://en.wikipedia.org/wiki/Leaf_subroutine

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/Cihfddaf.html
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0214b/CHDGHAAG.html

    http://en.wikipedia.org/wiki/Threaded_code
    http://en.wikipedia.org/wiki/Stack_machine

    Obviously, you wouldn’t BL to an ADD function if you could do it inline. Indeed, a profiler could work out from running code where the “Hot spots” were and inline all trivial linked functions within that inner loop. The use of variables that conform to traditional mathematics (i.e. they cannot be destructively reassigned), supports the simplification of complex expressions and often reduces the number of Opcodes needed in algorithms.

    Thank you for writing such an interesting article.

    1. ajbonkoski@gmail.com Post author

      Great follow up! I’m not sure if BL has much of a direct effect here. Most C compilers with turn this enum-style switch into a jump table for you, something like:

      void *JMPTABLE[] = {address of ADD, address of LOAD, address of STORE, address of GOTO}
      goto JMPTABLE[inst->opcode];
      ADD: // do add
      LOAD: // do load
      STORE: // do store
      GOTO: // do goto

      This will result in a single indirect branch.
      But, there are indeed ways to optimize this further to improve the branch prediction: see Nathan Kurz’s excellent comment.

      1. Uncompetative

        Rereading your code, I think you have missed my point. I was advocating simple compilation to ARM instructions the majority of which were BLs to optimised extensible Opcode implementations which relied upon a consistent twin stack model – e.g. the ADD operator would translate to a BL addLabel which, when encountered by the CPU, would call the routine at that label’s location to implement an ADD of the top two 32-bit words on the data stack, before returning from the return by moving the value in the link register into the program counter.

        goto JMPTABLE[inst->opcode];

        surely, needs to get a new inst from some “program memory” and be run in an interpretive loop?

        1. ajbonkoski@gmail.com Post author

          Ah, yes! I’m assuming that its a “pure” interpreter that does no JIT compilation what-so-ever, such as CPython. Compilation of the dispatch loop to assembly has many benefits. Since I’m hacking on DynVM (intended to ultimately be a full JIT), I’ll likely write some JIT analysis posts as I discover cool new things. This post was only intended as a high-level view of the problems and to understand why the C code resulted in ASM code that was so optimal, not a deep-dive into the potential mitigations of interpreted languages (via JIT).

  11. Makoto

    I’d be interested to see how much faster the Python code would run in PyPy. I’ve seen dramatic improvements in execution speed when using PyPy.

  12. Pierre Chapuis

    By choosing to compute modulo 2^64 you bias the benchmark entirely, because C doesn’t need to do it. Moreover, this is not what you would write in Python, and what you would write is faster. Oh, and by using a “low” number of iterations you measure interpreter startup time too.

    See https://gist.github.com/catwell/6389992 for a *slightly less biased* version (mod 32, more iterations). I still think Fibonacci will never be a good benchmark anyway. I threw LuaJIT into the mix to troll a little (it is “faster than C” on this).

    1. ajbonkoski@gmail.com Post author

      Your criticisms are correct: take the benchmark with a grain of salt. I enjoy fiddling with assembly and computer architecture, thus this post. It shouldn’t be considered an official rigorous result! Fibonacci is an awful benchmark, but I was perplexed as to why the gap was so great…

  13. J

    You picked a dynamic language implementation that is explicitly not designed for performance and then criticized its performance. Compare C to a native-code-generating JIT if you want to be fair, not that this comparison or post would provide any insight anyway…

    1. ajbonkoski@gmail.com Post author

      Clearly, I was never trying to do an apples to apples comparison… I was attempting to understand and describe the factors that contribute to pure interpreter slowdown on a modern architecture and C is typically the “golden standard” that is used as the performance baseline. To me, its counter-intuitive that the gap should be this great on such a simplistic test; I would have expected a 10-20x slowdown, not 200x. I will likely write more on JIT compiler code vs C in the future. In the mean time, maybe this is closer to what you’re interested in: http://benchmarksgame.alioth.debian.org/

    1. ajbonkoski@gmail.com Post author

      On which system and which compile options? Note: these numbers were taken on my laptop unplugged, so Linux had scaled the frequency down to 1.2 GHz to save power.

Comments are closed.