Easy MatchPosted: March 11, 2016 Introduction Easy Match is a replacement for regular expression libraries like PCRE. The two goals are to create a matching language simpler than regex and compiling the matching code into pure machine code using a JIT. The following platforms are supported
Easy Match was tested for ARM on Raspberry Pi with both Linux and Windows 10 IoT. For Windows 10 IoT, I built it with Visual Studio 2015 using nmake with Makefile.windows in the build directory, but it's currently not self-configuring with the ./configure script. For x86_64, I built the Windows x86_64 version with Mingw's gcc. To do:
Download git clone https://github.com/mikeakohn/easy_match.git Tests I have a test program to compare the functions of Easy Match against strncmp() and PCRE (with their JIT turned on as far as I can tell). The tests are run against src/generate_x86_64.c. The "count=" is the number of matches found (showing that all 4 versions of pattern matching are working properly) and "msec=" is the time in milliseconds it took to complete the operation on the entire file. All tests here are for x86_64.. [ starts_with('int ') ^int ]
[ ends_with(');') );$ ]
[ starts_with('int ') or ends_with(');') ^int|);$ ]
[ contains('int') int ]
[ equals(' return 0;') ^ return 0;$ ]
[ equals(' return 0') ^ return 0$ ]
API
Three functions are exported in this library: To generate a matching function, a set of commands are passed in with the following matching functions available:
Each one of these functions can be prefixed with the word "not". Multiple matching functions can be combined using the keywords "and" and "or". An example could be the following: match_t match = compiler_generate("startswith('int') and not endswith('}')"); Using a JIT compiler, this will generate a matching function that can now be used like this: int result = match("int a;"); Because the passed in line starts with "int" and ends with ";", the result of calling match should be a 1. If the line doesn't match all the requirements the result would be 0. To free the memory allocated by this generated function: compiler_free(match); Explanation This ends up being pretty similar to Java Grinder, except Easy Match generates code in memory instead into .asm files that need to be assembled. For an explanation on creating a JIT, read my JIT Compiler Development page. In order to make Easy Match more portable, it was written in pure C. There should therefore not be any extra dependencies to build or distribute. To make the code easier to read it was written in a kind of objected oriented way. There is a tokens.c module that is used to tokenize the user's matching code. The compiler.c module parses one token at a time and calls functions in generate.c to build an assembly language representation of the function in executable memory space. The generate.c module has several submodules based on what platform the matching function needs to run on. For example, if Easy Match is compiled on x86_64, the generate_linux_x86_64 module will be used. Which generate.c submodule to use is determined when Easy Match itself is compiled. An example of a pattern match I've been testing by running it against part of the Easy Match source code: ./easy_match src/generate.h "starts_with('int ') or starts_with('#def') and ends_with('1')" Generates the following function in x86_64: ; Save rbx 00000000 48895C24F8 mov [rsp-0x8],rbx ; Perform strlen(rdi) and store in rsi register 00000005 4889FE mov rsi,rdi 00000008 803E00 cmp byte [rsi],0x0 0000000B 7405 jz 0x12 0000000D 48FFC6 inc rsi 00000010 EBF6 jmp short 0x8 00000012 4829FE sub rsi,rdi ; starts_with('int ') ; First make sure strlen(rdi) > 4 00000015 4883FE04 cmp rsi,byte +0x4 00000019 7C0F jl 0x2a ; Check if 4 bytes pointed to by rdi are "int " 0000001B 813F696E7420 cmp dword [rdi],0x20746e69 00000021 7507 jnz 0x2a 00000023 B801000000 mov eax,0x1 00000028 EB02 jmp short 0x2c 0000002A 31C0 xor eax,eax ; Result of starts_with is in eax. If it's true ; short circuit to end of block of "or" operations. 0000002C 85C0 test eax,eax 0000002E 754B jnz 0x7b ; starts_with('#def') ; First make sure strlen(rdi) > 4 00000030 4883FE04 cmp rsi,byte +0x4 00000034 7C10 jl 0x46 ; Check if 4 bytes pointed to by rdi are "#def" 00000036 813F23646566 cmp dword [rdi],0x66656423 0000003C 7508 jnz 0x46 0000003E 41B801000000 mov r8d,0x1 00000044 EB03 jmp short 0x49 00000046 4D31C0 xor r8,r8 ; Result of starts_with is in r8. If it's false ; short circuit to end of block of "and" operations. 00000049 4D85C0 test r8,r8 0000004C 742A jz 0x78 ; ends_with('0') ; First make sure strlen(rdi) > 1 0000004E 4883FE01 cmp rsi,byte +0x1 00000052 7C19 jl 0x6d ; Save rdi 00000054 48897C24F0 mov [rsp-0x10],rdi ; Point rdi to the end of string - 1 00000059 4801F7 add rdi,rsi 0000005C 4883EF01 sub rdi,byte +0x1 00000060 803F31 cmp byte [rdi],0x31 00000063 7508 jnz 0x6d 00000065 41B901000000 mov r9d,0x1 0000006B EB03 jmp short 0x70 0000006D 4D31C9 xor r9,r9 ; Restore rdi 00000070 488B7C24F0 mov rdi,[rsp-0x10] ; Perform "and" operation on last 2 results 00000075 4D21C8 and r8,r9 ; Perform "or" operation on that with first result 00000078 4C09C0 or rax,r8 ; Restore rbx. Register rax holds return value 0000007B 488B5C24F8 mov rbx,[rsp-0x8] 00000080 C3 ret
Copyright 1997-2024 - Michael Kohn
|