Easy MatchPosted: March 11, 2016 Introduction Easy Match is a replacement for regular expression libraries like PCRE. The two goals are to create a matching language simpler than regex and compiling the matching code into pure machine code using a JIT. The following platforms are supported
Easy Match was tested for ARM on Raspberry Pi with both Linux and Windows 10 IoT. For Windows 10 IoT, I built it with Visual Studio 2015 using nmake with Makefile.windows in the build directory, but it's currently not self-configuring with the ./configure script. For x86_64, I built the Windows x86_64 version with Mingw's gcc. To do:
Download git clone https://github.com/mikeakohn/easy_match.git Tests I have a test program to compare the functions of Easy Match against strncmp() and PCRE (with their JIT turned on as far as I can tell). The tests are run against src/generate_x86_64.c. The "count=" is the number of matches found (showing that all 4 versions of pattern matching are working properly) and "msec=" is the time in milliseconds it took to complete the operation on the entire file. All tests here are for x86_64.. [ starts_with('int ') ^int ]
[ ends_with(');') );$ ]
[ starts_with('int ') or ends_with(');') ^int|);$ ]
[ contains('int') int ]
[ equals(' return 0;') ^ return 0;$ ]
[ equals(' return 0') ^ return 0$ ]
API
Three functions are exported in this library: To generate a matching function, a set of commands are passed in with the following matching functions available:
Each one of these functions can be prefixed with the word "not". Multiple matching functions can be combined using the keywords "and" and "or". An example could be the following: match_t match = compiler_generate("startswith('int') and not endswith('}')"); Using a JIT compiler, this will generate a matching function that can now be used like this: int result = match("int a;"); Because the passed in line starts with "int" and ends with ";", the result of calling match should be a 1. If the line doesn't match all the requirements the result would be 0. To free the memory allocated by this generated function: compiler_free(match); Explanation This ends up being pretty similar to Java Grinder, except Easy Match generates code in memory instead into .asm files that need to be assembled. For an explanation on creating a JIT, read my JIT Compiler Development page. In order to make Easy Match more portable, it was written in pure C. There should therefore not be any extra dependencies to build or distribute. To make the code easier to read it was written in a kind of objected oriented way. There is a tokens.c module that is used to tokenize the user's matching code. The compiler.c module parses one token at a time and calls functions in generate.c to build an assembly language representation of the function in executable memory space. The generate.c module has several submodules based on what platform the matching function needs to run on. For example, if Easy Match is compiled on x86_64, the generate_linux_x86_64 module will be used. Which generate.c submodule to use is determined when Easy Match itself is compiled. An example of a pattern match I've been testing by running it against part of the Easy Match source code:
./easy_match src/generate.h "starts_with('int ') or starts_with('#def') and ends_with('1')"
Generates the following function in x86_64:
; Save rbx
00000000 48895C24F8 mov [rsp-0x8],rbx
; Perform strlen(rdi) and store in rsi register
00000005 4889FE mov rsi,rdi
00000008 803E00 cmp byte [rsi],0x0
0000000B 7405 jz 0x12
0000000D 48FFC6 inc rsi
00000010 EBF6 jmp short 0x8
00000012 4829FE sub rsi,rdi
; starts_with('int ')
; First make sure strlen(rdi) > 4
00000015 4883FE04 cmp rsi,byte +0x4
00000019 7C0F jl 0x2a
; Check if 4 bytes pointed to by rdi are "int "
0000001B 813F696E7420 cmp dword [rdi],0x20746e69
00000021 7507 jnz 0x2a
00000023 B801000000 mov eax,0x1
00000028 EB02 jmp short 0x2c
0000002A 31C0 xor eax,eax
; Result of starts_with is in eax. If it's true
; short circuit to end of block of "or" operations.
0000002C 85C0 test eax,eax
0000002E 754B jnz 0x7b
; starts_with('#def')
; First make sure strlen(rdi) > 4
00000030 4883FE04 cmp rsi,byte +0x4
00000034 7C10 jl 0x46
; Check if 4 bytes pointed to by rdi are "#def"
00000036 813F23646566 cmp dword [rdi],0x66656423
0000003C 7508 jnz 0x46
0000003E 41B801000000 mov r8d,0x1
00000044 EB03 jmp short 0x49
00000046 4D31C0 xor r8,r8
; Result of starts_with is in r8. If it's false
; short circuit to end of block of "and" operations.
00000049 4D85C0 test r8,r8
0000004C 742A jz 0x78
; ends_with('0')
; First make sure strlen(rdi) > 1
0000004E 4883FE01 cmp rsi,byte +0x1
00000052 7C19 jl 0x6d
; Save rdi
00000054 48897C24F0 mov [rsp-0x10],rdi
; Point rdi to the end of string - 1
00000059 4801F7 add rdi,rsi
0000005C 4883EF01 sub rdi,byte +0x1
00000060 803F31 cmp byte [rdi],0x31
00000063 7508 jnz 0x6d
00000065 41B901000000 mov r9d,0x1
0000006B EB03 jmp short 0x70
0000006D 4D31C9 xor r9,r9
; Restore rdi
00000070 488B7C24F0 mov rdi,[rsp-0x10]
; Perform "and" operation on last 2 results
00000075 4D21C8 and r8,r9
; Perform "or" operation on that with first result
00000078 4C09C0 or rax,r8
; Restore rbx. Register rax holds return value
0000007B 488B5C24F8 mov rbx,[rsp-0x8]
00000080 C3 ret
Copyright 1997-2025 - Michael Kohn
|