Mike Kohn!

CONTENTS

YouTube
BlueSky
GitHub
LinkedIn

JIT Compiler Development

Posted: April 2, 2016

Introduction

I don't consider myself an expert in compilers or anything, but since I've recently been working on a project called Easy Match, I thought maybe others could benefit from what I've learned. Instead of focusing on the tokenizer and parser and such, I just want to focus on how I went about allocating executable memory and writing assembly language instructions for x86_64, x86, and ARM into it.

So I think the easiest thing to do is to start with a bit of a simpler example. The following program will take 4 parameters from the command line (actually it will take more or less and probably crash all over the place, but I wanted to keep the program short so there's no error checking) and reorder the bytes in the array some_data, copying them into dest_data in the new order based on user input from the command line:


int main(int argc, char *argv[])
{
  uint32_t some_data[] = { 1, 2, 3, 4 };
  uint32_t dest_data[4];
  int n;

  for (n = 1; n < argc; n++)
  {
    // Convert next int from command line from text to int.
    int s = atoi(argv[n]);

    // d is the offset to the next element in the destination array.
    int d = n - 1;

    // Copy element of source array into destination based on user's input.
    dest_data[d] = some_data[s];
  }

  // Print the original array.
  for (n = 0; n < 4; n++) { printf(" %d", some_data[n]); }
  printf("\n");

  // Print the reordered array.
  for (n = 0; n < 4; n++) { printf(" %d", dest_data[n]); }
  printf("\n");

  return 0;
}

This program hardcodes 4 ints stored in an array, but the user wants to decide the order they get copied into a second array. In this case, there is a for loop that simply remaps where the element from the source array goes in the destination. If the array was large and needed to be run many times in the exact same pattern, this could benefit from being compiled into a single assembly language function as shown below:


// Define the prototype for the JIT compiled function.
typedef int (*shuffle_t)(uint32_t *dst, uint32_t *src);

int main(int argc, char *argv[])
{
  uint32_t some_data[] = { 1, 2, 3, 4 };
  uint32_t dest_data[4];
  uint8_t *code;
  int ptr = 0;
  shuffle_t shuffle;
  int n;

  memset(dest_data, 0, sizeof(dest_data));

  // Allocate executable memory for new function.
  shuffle = mmap(NULL, 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
  code = (uint8_t *)shuffle;

  // Generate code based on command line arguments
  for (n = 1; n < argc; n++)
  {
    // Convert next int from command line from text to int.
    int s = atoi(argv[n]);

    // d is the offset to the next element in the destination array.
    int d = n - 1;

    if (s == 0)
    {
      // mov eax, [rsi]
      code[ptr++] = 0x8b;
      code[ptr++] = 0x06;
    }
      else
    {
      // mov eax, [rsi+(s*sizeof(int))]
      code[ptr++] = 0x8b;
      code[ptr++] = 0x46;
      code[ptr++] = s * sizeof(int);
    }

    if (d == 0)
    {
      // mov [rdi], eax
      code[ptr++] = 0x89;
      code[ptr++] = 0x07;
    }
      else
    {
      // mov [rdi+(d*sizeof(int))], eax
      code[ptr++] = 0x89;
      code[ptr++] = 0x47;
      code[ptr++] = d * sizeof(int);
    }
  }

  // ret
  code[ptr] = 0xc3;

  // Call shuffle to copy memory.
  printf("shuffle()=%d\n", shuffle(dest_data, some_data));

  // Print the original array.
  for (n = 0; n < 4; n++) { printf(" %d", some_data[n]); }
  printf("\n");

  // Print the reordered array.
  for (n = 0; n < 4; n++) { printf(" %d", dest_data[n]); }
  printf("\n");

  // Free memory.
  munmap(shuffle, 4096);

  return 0;
}

On Windows, mmap() can be substituted with VirtualAlloc(). Also note that the ABI on Windows for x86_64 is different than Linux. The above example expects parameter 0 to be passed in RDI and and parameter 1 to be passed in RSI. On Windows x86_64, parameter 0 will be passed in RCX and parameter 1 in RDX. Luckily on x86 and ARM the calling conventions between Windows and Linux should be the same. Check the Wikipedia page on x86 calling conventions for more info.

Having to look up and manually type in opcodes for assembly language instructions is extremely tedious and error prone. I wrote a Python script (along with an ARM version) for my Easy Match project to make creating a JIT easier:

get_opcodes_x86.py
get_opcodes_arm.py

The x86 script takes two parameters, the first being a 32 or 64 depending on if the opcodes should be for 32 x86 or 64 bit x86_64. The second parameter is the assembly language code to be assembled. The x86 script uses nasm to assemble the code and retrieve the opcodes. The ARM script uses naken_asm. So for example if I needed to know what the opcodes were for "mov eax, [rdi]" I would do:

python scripts/get_opcodes_x86.py 64 "mov [rdi], eax"

And the resulting output that I can cut and paste into my compiler would be:

  // mov [rdi], eax: 0x89,0x07
  generate_code(generate, 2, 0x89, 0x07);

A little more complicated is mov [rdi+offset], eax since if the offset is more than 127 (or less than -128) the instruction will need more bytes of code memory:


  // mov [rdi+1], eax: 0x89,0x47,0x01
  generate_code(generate, 3, 0x89, 0x47, 0x01);

  // mov [rdi+128], eax: 0x89,0x87,0x80,0x00,0x00,0x00
  generate_code(generate, 6, 0x89, 0x87, 0x80, 0x00, 0x00, 0x00);

After running the script on "mov [rdi+1], eax", analyzing the 3 bytes, it's pretty obvious the 3rd byte represents the signed offset from rdi. In the example above I replaced the 0x01 with the offset the user entered in the command line. For the "mov [rdi+128], eax" case, the last four bytes (128, 0, 0, 0) are clearly the representation of the 4 byte signed integer 128 in little endian format.

It is possible to use 6 byte version of mov [rdi+offset], eax, even for offsets between -128 and 127, but then code density sucks. It will take more bytes than necessary to represent this instruction which will take longer to load into the CPU from DRAM and less of the executable will be able to fit into the CPU's cache causing it possibly have to repeat fetching the instructions from DRAM. In Easy Match I take care of it with code like this:


  if (n == 0)
  {
    // mov rdx, [rdi]: 0x48 0x8b 0x17
    generate_code(generate, 3, 0x48, 0x8b, 0x17);
  }
    else
  if (n < 128)
  {
    // mov rdx, [rdi+1]: 0x48 0x8b 0x57 0x01
    generate_code(generate, 4, 0x48, 0x8b, 0x57, n);
  }
    else
  {
    // mov rdx, [rdi+128]: 0x48 0x8b 0x97 0x80 0x00 0x00 0x00
    generate_code(generate, 7, 0x48, 0x8b, 0x97,
      n & 0xff, (n >> 8) & 0xff, (n >> 16) & 0xff, (n >> 24) & 0xff);
  }

Or for ALU operations:


  if (len < 128)
  {
    // sub rdi, 100: 0x48 0x83 0xef 0x64
    generate_code(generate, 4, 0x48, 0x83, 0xef, len);
  }
    else
  if (len < 32768)
  {
    // sub rdi, 128: 0x48 0x81 0xef 0x80 0x00 0x00 0x00
    generate_code(generate, 7, 0x48, 0x81, 0xef,
      len & 0xff, (len >> 8) & 0xff, (len >> 16) & 0xff, (len >> 24) & 0xff);
  }

Branches in x86 are a bit awkward also since there is a short branch that can add between -128 to 127 to the program counter or a 32 bit signed value. That's the difference between between an instruction taking 2 bytes and an instruction taking 6.


  // Calculate how many bytes from the current PC to
  // the jump destination.
  distance = generate->ptr - label;

  if (distance + 2 <= 128)
  {
    // This is always a backwards jump so the distance is
    // negative.  Add 2 because the number of bytes in this
    // instruction is 2 and the distance is calculated from
    // where the PC will be after the jump instruction is read
    // in.
    distance = -(distance + 2);

    // jnz label: 0x75 label
    generate_code(generate, 2, 0x75, distance);
  }
    else
  {
    // This is always a backwards jump so the distance is
    // negative.  Add 6 because the number of bytes in this
    // instruction is 6 and the distance is calculated from
    // where the PC will be after the jump instruction is read
    // in.
    distance = -(distance + 6);

    // jnz label: 0x0f 0x85 label
    generate_code(generate, 6, 0x0f, 0x85,
      distance & 0xff,
      (distance >> 8) & 0xff,
      (distance >> 16) & 0xff,
      (distance >> 24) & 0xff);
  }

After generating all the executable code, it would be a good idea to use mprotect() (or on Windows VirtualProtect()) to set the memory to remove read/write permission and make it execute only.


mprotect(match, MAX_CODE_SIZE, PROT_EXEC);

In order to keep the compiler source code from getting ugly with a bunch of #ifdef's for the difference between Unix and Windows memory allocations, I created some macros:


// Compatibility macros for Windows
#ifdef WINDOWS
#define PROT_EXEC PAGE_EXECUTE
#define PROT_WRITE PAGE_EXECUTE_READWRITE
#define PROT_READ 0
#define MAP_FAILED NULL

#define mmap(addr, length, prot, flags, fd, offset) \
  VirtualAlloc(addr, length, MEM_RESERVE | MEM_COMMIT, prot);
#define mprotect(addr, len, prot) VirtualProtect(addr, len, prot, NULL)
#define munmap(addr, len) VirtualFree(addr, 0, MEM_RELEASE)
#define alloca _alloca
#endif

Be careful with these macros, they work well for my use-case but might fail in other situations.

An interesting note, it's actually possible to do the same thing in Java. I have an example here where I generate a Java class in memory with Java assembly opcodes: Java Class Generator. The Java JVM will run the code in this class through the interpreter or decide to compile it into native code with the Java JIT. This opens a lot of possibilities including connecting C and Java code by bypassing a lot of JNI calls.

Anyway, if something isn't clear here let know and I'll write a little more about it.