This is curious. Their trace-based approach looks like Mozilla's Tracemonkey - e...

kmavm · on Nov 30, 2012

It isn't tracing. Don't be fooled by the name "tracelet".

Unlike trace trees, tracelets contain no control flow. It is just a type-specialized basic block. This means that the jit output does not combinatorially explode when we encounter polymorphism. In many ways, our approach is the opposite of tracing; we jit the first time we hit any code at all. And we look terrible on loopy microbench code, but run FB's multi million LOC application very well.

pbiggar · on Nov 30, 2012

Very interesting. Rereading with that insight makes your approach clearer - side exits aren't control flow, they're for the weird shit that PHP can throw at you. Aliased local variables are one of the major differences between PHP and Javascript, and that's a really good way of handling them.

So it avoid trace explosion, and I would describe this as a sort-of local (in the compiler sense of "basic block") version of a method compiler. Can you compare the approaches?

masklinn · on Nov 30, 2012

So it's closer to V8's "basic" JIT?

kmavm · on Nov 30, 2012

Our approach is different from other dynamic language JITs that I'm aware of. While there is lots of diversity out there, most other dynamic language JITs can be roughly categorized as either having a method-at-a-time strategy, or a tracing strategy. Our system is neither tracing jit nor method-at-a-time, but basic-block-at-a-time.

The systems that it has the most in common with are actually binary translators; e.g., VMware's software x86 hypervisor, or the Dynamo system. Those systems run basic block at a time to solve a number of problems; e.g., disambiguation of code and data. We're basic-block-at-a-time for a different reason: closure under type inference.

nickik · on Nov 30, 2012

Im not a expert but I think the V8 compiles methods that contain controlflow even in the basic compiler.

pbiggar · on Nov 30, 2012

One more thing - the use of a stack-based bytecode is also curious. PHP itself (the Zend engine) uses a register-based bytecode (well, it uses a horror-show of an in-memory bytecode-like-thing, which could best be described as a register-based bytecode). The PHP engine details leak into the language a lot, so the difficulties they describe aren't unexpected.

Not only that, but bleeding-edge JITs typically use register-based bytecode. Java's JIT and all the modern versions of Javascript VMs use register-based (in many cases actually translating from stack-based bytecode that the parser outputs, to register-based bytecode consumed by the JIT compilers). Since 2003, when my PhD advisor and his co-authors (Google "David Gregg" and "Anton Ertl") showed register VMs could be much more efficient that stack-based ones, the jury has settled on register-based VMs.

So I'm curious as to why a stack-based instruction set was used in the VM design?

dparoski · on Nov 30, 2012

As noted elsewhere in this thread, a stack-based design typically produces more compact bytecode. Compactness was a concern for us because of the size of FB's PHP code base. Also, generally speaking a stack-based design tends to be easier to deal with when working to get a prototype VM up and running quickly.

Many of the advantages of register-based designs (ability to optimize by rewriting the program at the bytecode level, ability to write faster interpreters, ability to map bytecode registers to physical registers, etc.) weren't particularly attractive to us because we knew we were going to build an x64 JIT that did its own analysis and optimization to take advantage of type information observed at run time.

Thus, we drafted a stack-based design for HipHop bytecode. It captured PHP's semantics correctly and happened to fit in fairly well with PHP's evaluation order, so we ran with it and here we are.

pbiggar · on Nov 30, 2012

Makes total sense, thanks!

Would using a register-based bytecode not have been useful for the x64 JIT?

kmavm · on Nov 30, 2012

Check out our HHIR work. It's an SSA-based IR, that gets us most of the advantages of a register representation. But it is at a much lower level than the bytecode.

sb · on Nov 30, 2012

AFAICT, two reasons come to mind:

1) A stack-based ISA is still more space-compact. (AFAIR the Shi et al. paper mentions something like a 40+% increase in space requirement for the instructions.)

2) The performance improvement of a register-based ISA is only visible for interpreters that suffer most from instruction dispatch [1]. PHP is a rather complex programming language that is most certainly not bound by instruction dispatch at all. So I guess it could very well make sense to stick with a simple stack-based ISA, which incidentally is also easier to compile from the AST.

[1] for the sake of completeness: the stack architecture emits many instructions to push operands onto the operand stack. A register-based interpreter does not need those. Hence, the overall number of dispatches is lower, and if dispatch cost is your bottleneck the overall performance increases. OTOH, you need more space, because in addition to the opcodes, you need to specify/encode which registers to take operands from and put results to. (Hint: quadruple code and the likes.)

pbiggar · on Nov 30, 2012

1) Yes, that's my recollection too. I think 40% is exactly right.

2) But this is exactly my point. Zend PHP uses a register-based bytecode. The HipHop JIT could choose anything - why choose stack-based - its not only different, but also worse! If I had to guess, I'd say it was chosen to make the HipHop interpreter easier to implement, and then was legacy code when creating the JIT, and it was easier to keep it than to replace it.

masklinn · on Nov 30, 2012

> all the modern versions of Javascript VMs use register-based (in many cases actually translating from stack-based bytecode that the parser outputs, to register-based bytecode consumed by the JIT compilers)

Pretty sure V8 doesn't as it has no interpreter at all, it has a fast JIT and a slow JIT, but it only ever runs native code.

pbiggar · on Nov 30, 2012

Sorry, I didn't mean to imply they all use interpreters. The more compilery parts use three-address code and SSA, which are IRs that are conceptually similar to a register-based bytecode.

eropple · on Nov 30, 2012

Java's HotSpot is a stack machine, AFAIK. Dalvik is a register machine, but I wouldn't want to point to that as a positive. ;)

pbiggar · on Nov 30, 2012

Internally, Hotspot doesn't use a stack representation. It converts it to an IR (internal representation) which is not stack based, which is my point. See Fig 2 from http://www-leland.stanford.edu/class/cs343/resources/java-ho...

eropple · on Dec 3, 2012

Huh, so it's not. Thanks for the interesting read!

sb · on Nov 30, 2012

This is somewhat different from my memory. AFAIR, TraceMonkey was a trace-based JIT. JaegerMonkey was a JIT (however not just-in-time like V8 with a template-based JIT, but still using an interpreter initially). IonMonkey is--to the best of my knowledge--JaegerMonkey plus type inference (there was a paper by two Mozilla employees at PLDI'12 about their type inference.) So I guess it's not really using anything from trace compilation.

I think a trace-JIT still gives you a lot of bang for the buck and is (in theory at least) easier to implement. Two known projects using trace-compilation are LuaJIT2 (usually well known) and Dalvik VM's JIT compiler (not so well known, needed to watch the Google I/O 2010 announcement.)

rpearl · on Nov 30, 2012

There's a bunch of not-quite-correct information here... Let me try to summarize. Jaegermonkey is a JIT. It takes spidermonkey bytecode, and when functions grow hot, it compiles them to native code in a way quite similar to v8's full AST compiler (wit no optimizations).

IonMonkey is also a JIT. However, instead of just doing translations directly from bytecode to native code, it builds up an SSA-based IR, does some optimizations (last I checked, loop-invariant code motion, range analysis, global value numbering, on-stack replacement to be able to jump to JITted code in the middle of a hot loop, and some other small optimizations). It then does register allocation using LSRA. It takes advantage of type inference to aggressively type-specialize these optimizations.

This is all more expensive than just blatting out native code, but the code performs better.

In IonMonkey-enabled Firefox (Firefox 18+), the compilation strategy is to interpret, then use JM on hot functions. Particularly hot functions will get IM compiled.

This is similar to what "crankshaft", v8's version, does. The compilation strategy is quite similar, but v8 has no interpreter--it has two compilations: a fast, non-optimized one and a slow, optimized one.

Trace-based computation is...I would say... more complicated than a traditional compiler approach. It is certainly differently-complicated than the usual compiler problem. Recording traces requires some relatively complicated state, and significantly muddle up the interpreter. Removing the tracer from the Mozilla code base made things substantially cleaner--the tracer was holding back the method-based JITs.

monopede · on Nov 30, 2012

Note that Dalvik is actually not a very good trace JIT. There's a recent paper comparing Dalvik to an embedded version of Hotspot. The Dalvik interpreter was slightly faster (~5%), but the JIT speedup was only 3-4x whereas it was ~12x for the method JIT. The reason is that Dalvik's traces are way too short and cannot even span method boundaries. That means little scope for optimisation and a lot of code just to load the state from the stack into registers and then write it back. Weirdly enough, Dalvik also generates more code than the method JIT, yet low amount of compilation overhead was their main goal.

Fortunately for Dalvik, most applications actually spend most of their time in native code.

The paper (behind ACM paywall): http://dl.acm.org/citation.cfm?doid=2388936.2388956