JIT Compiler Development
Posted: April 2, 2016
I don't consider myself an expert in compilers or anything, but since I've recently been working on a project called Easy Match, I thought maybe others could benefit from what I've learned. Instead of focusing on the tokenizer and parser and such, I just want to focus on how I went about allocating executable memory and writing assembly language instructions for x86_64, x86, and ARM into it.
So I think the easiest thing to do is to start with a bit of a simpler example. The following program will take 4 parameters from the command line (actually it will take more or less and probably crash all over the place, but I wanted to keep the program short so there's no error checking) and reorder the bytes in the array some_data, copying them into dest_data in the new order based on user input from the command line:
This program hardcodes 4 ints stored in an array, but the user wants to decide the order they get copied into a second array. In this case, there is a for loop that simply remaps where the element from the source array goes in the destination. If the array was large and needed to be run many times in the exact same pattern, this could benefit from being compiled into a single assembly language function as shown below:
On Windows, mmap() can be substituted with VirtualAlloc(). Also note that the ABI on Windows for x86_64 is different than Linux. The above example expects parameter 0 to be passed in RDI and and parameter 1 to be passed in RSI. On Windows x86_64, parameter 0 will be passed in RCX and parameter 1 in RDX. Luckily on x86 and ARM the calling conventions between Windows and Linux should be the same. Check the Wikipedia page on x86 calling conventions for more info.
Having to look up and manually type in opcodes for assembly language instructions is extremely tedious and error prone. I wrote a Python script (along with an ARM version) for my Easy Match project to make creating a JIT easier:
The x86 script takes two parameters, the first being a 32 or 64 depending on if the opcodes should be for 32 x86 or 64 bit x86_64. The second parameter is the assembly language code to be assembled. The x86 script uses nasm to assemble the code and retrieve the opcodes. The ARM script uses naken_asm. So for example if I needed to know what the opcodes were for "mov eax, [rdi]" I would do:
python scripts/get_opcodes_x86.py 64 "mov [rdi], eax"
And the resulting output that I can cut and paste into my compiler would be:
// mov [rdi], eax: 0x89,0x07 generate_code(generate, 2, 0x89, 0x07);
A little more complicated is mov [rdi+offset], eax since if the offset is more than 127 (or less than -128) the instruction will need more bytes of code memory:
After running the script on "mov [rdi+1], eax", analyzing the 3 bytes, it's pretty obvious the 3rd byte represents the signed offset from rdi. In the example above I replaced the 0x01 with the offset the user entered in the command line. For the "mov [rdi+128], eax" case, the last four bytes (128, 0, 0, 0) are clearly the representation of the 4 byte signed integer 128 in little endian format.
It is possible to use 6 byte version of mov [rdi+offset], eax, even for offsets between -128 and 127, but then code density sucks. It will take more bytes than necessary to represent this instruction which will take longer to load into the CPU from DRAM and less of the executable will be able to fit into the CPU's cache causing it possibly have to repeat fetching the instructions from DRAM. In Easy Match I take care of it with code like this:
Or for ALU operations:
Branches in x86 are a bit awkward also since there is a short branch that can add between -128 to 127 to the program counter or a 32 bit signed value. That's the difference between between an instruction taking 2 bytes and an instruction taking 6.
After generating all the executable code, it would be a good idea to use mprotect() (or on Windows VirtualProtect()) to set the memory to remove read/write permission and make it execute only.
In order to keep the compiler source code from getting ugly with a bunch of #ifdef's for the difference between Unix and Windows memory allocations, I created some macros:
Be careful with these macros, they work well for my use-case but might fail in other situations.
An interesting note, it's actually possible to do the same thing in Java. I have an example here where I generate a Java class in memory with Java assembly opcodes: Java Class Generator. The Java JVM will run the code in this class through the interpreter or decide to compile it into native code with the Java JIT. This opens a lot of possibilities including connecting C and Java code by bypassing a lot of JNI calls.
Anyway, if something isn't clear here let know and I'll write a little more about it.
Copyright 1997-2021 - Michael Kohn