CONTENTS

Home
Updates
Software
Electronics
Music
Resume
Contact


YouTube
Twitter
GitHub
LinkedIn


Easy Match

Posted: March 11, 2016

Introduction

Easy Match is a replacement for regular expression libraries like PCRE. The two goals are to create a matching language simpler than regex and compiling the matching code into pure machine code using a JIT.

The following platforms are supported

  • x86_64: Linux, MacOSX, FreeBSD, Windows
  • x86: Linux, Windows
  • ARM: Linux, Windows
  • ARM64: Linux (planned)

Easy Match was tested for ARM on Raspberry Pi with both Linux and Windows 10 IoT. For Windows 10 IoT, I built it with Visual Studio 2015 using nmake with Makefile.windows in the build directory, but it's currently not self-configuring with the ./configure script. For x86_64, I built the Windows x86_64 version with Mingw's gcc.

To do:

  • Need to add support in the compiler for parenthesis.
  • The x86 generator doesn't have short-circuiting optimization.
  • Code size is limited to 4k. Need to add growing support.
  • Use mprotect() / VirtualProtect() to make exe memory execute only.
  • Write lots of tests.
  • Add ARM64 (need docs and a board).

Download

git clone https://github.com/mikeakohn/easy_match.git

Tests

I have a test program to compare the functions of Easy Match against strncmp() and PCRE (with their JIT turned on as far as I can tell). The tests are run against src/generate_x86_64.c. The "count=" is the number of matches found (showing that all 4 versions of pattern matching are working properly) and "msec=" is the time in milliseconds it took to complete the operation on the entire file. All tests here are for x86_64..

[ starts_with('int ') ^int ]

  • Easy Match count=11 msec=0.026282
  • Easy Match (len) count=11 msec=0.026197
  • PCRE count=11 msec=0.122402
  • strncmp() count=11 msec=0.029939

[ ends_with(');') );$ ]

  • Easy Match count=124 msec=0.064317
  • Easy Match (len) count=124 msec=0.014167
  • PCRE count=124 msec=0.202526
  • strncmp() count=124 msec=0.075861

[ starts_with('int ') or ends_with(');') ^int|);$ ]

  • Easy Match count=135 msec=0.053906
  • Easy Match (len) count=135 msec=0.015211
  • PCRE count=135 msec=0.305322

[ contains('int') int ]

  • Easy Match count=57 msec=0.123988
  • Easy Match (len) count=57 msec=0.059791
  • PCRE count=57 msec=0.146438
  • strstr() count=57 msec=0.087332

[ equals(' return 0;') ^ return 0;$ ]

  • Easy Match count=16 msec=0.025463
  • Easy Match (len) count=16 msec=0.024647
  • PCRE count=16 msec=0.098309
  • strcmp() count=16 msec=0.021507

[ equals(' return 0') ^ return 0$ ]

  • Easy Match count=0 msec=0.025164
  • Easy Match (len) count=0 msec=0.024221
  • PCRE count=0 msec=0.096943
  • strcmp() count=0 msec=0.029123

API

Three functions are exported in this library:

match_t match = compiler_generate(char *code);
int result = match(char *line);
compiler_free(match_t match);

To generate a matching function, a set of commands are passed in with the following matching functions available:

  • startswith('text')
  • endswith('text')
  • match_at('text',index)
  • equals('text')
  • contains('text')

Each one of these functions can be prefixed with the word "not". Multiple matching functions can be combined using the keywords "and" and "or". An example could be the following:

match_t match = compiler_generate("startswith('int') and not endswith('}')");

Using a JIT compiler, this will generate a matching function that can now be used like this:

int result = match("int a;");

Because the passed in line starts with "int" and ends with ";", the result of calling match should be a 1. If the line doesn't match all the requirements the result would be 0.

To free the memory allocated by this generated function:

compiler_free(match);

Explanation

This ends up being pretty similar to Java Grinder, except Easy Match generates code in memory instead into .asm files that need to be assembled. For an explanation on creating a JIT, read my JIT Compiler Development page.

In order to make Easy Match more portable, it was written in pure C. There should therefore not be any extra dependencies to build or distribute. To make the code easier to read it was written in a kind of objected oriented way. There is a tokens.c module that is used to tokenize the user's matching code. The compiler.c module parses one token at a time and calls functions in generate.c to build an assembly language representation of the function in executable memory space. The generate.c module has several submodules based on what platform the matching function needs to run on. For example, if Easy Match is compiled on x86_64, the generate_linux_x86_64 module will be used. Which generate.c submodule to use is determined when Easy Match itself is compiled.

An example of a pattern match I've been testing by running it against part of the Easy Match source code:

./easy_match src/generate.h "starts_with('int ') or starts_with('#def') and ends_with('1')"

Generates the following function in x86_64:

; Save rbx
00000000  48895C24F8        mov [rsp-0x8],rbx

; Perform strlen(rdi) and store in rsi register
00000005  4889FE            mov rsi,rdi
00000008  803E00            cmp byte [rsi],0x0
0000000B  7405              jz 0x12
0000000D  48FFC6            inc rsi
00000010  EBF6              jmp short 0x8
00000012  4829FE            sub rsi,rdi

; starts_with('int ')
; First make sure strlen(rdi) > 4
00000015  4883FE04          cmp rsi,byte +0x4
00000019  7C0F              jl 0x2a
; Check if 4 bytes pointed to by rdi are "int "
0000001B  813F696E7420      cmp dword [rdi],0x20746e69
00000021  7507              jnz 0x2a
00000023  B801000000        mov eax,0x1
00000028  EB02              jmp short 0x2c
0000002A  31C0              xor eax,eax
; Result of starts_with is in eax.  If it's true
; short circuit to end of block of "or" operations.
0000002C  85C0              test eax,eax
0000002E  754B              jnz 0x7b

; starts_with('#def')
; First make sure strlen(rdi) > 4
00000030  4883FE04          cmp rsi,byte +0x4
00000034  7C10              jl 0x46
; Check if 4 bytes pointed to by rdi are "#def"
00000036  813F23646566      cmp dword [rdi],0x66656423
0000003C  7508              jnz 0x46
0000003E  41B801000000      mov r8d,0x1
00000044  EB03              jmp short 0x49
00000046  4D31C0            xor r8,r8
; Result of starts_with is in r8.  If it's false
; short circuit to end of block of "and" operations.
00000049  4D85C0            test r8,r8
0000004C  742A              jz 0x78

; ends_with('0')
; First make sure strlen(rdi) > 1
0000004E  4883FE01          cmp rsi,byte +0x1
00000052  7C19              jl 0x6d
; Save rdi
00000054  48897C24F0        mov [rsp-0x10],rdi
; Point rdi to the end of string - 1
00000059  4801F7            add rdi,rsi
0000005C  4883EF01          sub rdi,byte +0x1
00000060  803F31            cmp byte [rdi],0x31
00000063  7508              jnz 0x6d
00000065  41B901000000      mov r9d,0x1
0000006B  EB03              jmp short 0x70
0000006D  4D31C9            xor r9,r9
; Restore rdi
00000070  488B7C24F0        mov rdi,[rsp-0x10]
; Perform "and" operation on last 2 results
00000075  4D21C8            and r8,r9
; Perform "or" operation on that with first result
00000078  4C09C0            or rax,r8

; Restore rbx. Register rax holds return value
0000007B  488B5C24F8        mov rbx,[rsp-0x8]
00000080  C3                ret

Copyright 1997-2024 - Michael Kohn