SIMD Image ProcessingPosted: April 2007 Introduction This page is an example of how to use SSE (actually more correctly SSE2) that exists in Pentium4 and AMD64 CPU's to improve performance of image processing functions that increase brightness and YUV422 to RGB conversion. I also have examples here of how to use Altivec on the PowerPC CPU and WMMX on Xscale. SSE (basically a 128 bit version of the 64 bit MMX instruction set) is Intel's SIMD (Single Instruction, Multiple Data) instruction set included on Pentium III and extended to SSE2 on Pentium 4's. SSE2 adds integer math to SSE's floating point processor. The asm code here assembles with the NASM Assembler. Anyone wanting to play around with YUV colorspace, I created a small Javascript program that can convert between RGB and YUV and display what the color values look like on this page: https://www.mikekohn.net/file_formats/yuv_rgb_converter.php. I have another possibly interesting SSE project here: Mandelbrots with SIMD Assembly. Update December 12, 2018: I started cleaning up the code, changing it to 64 bit and adding some AVX2 versions of the code (still working on it). I put everything in a git repository linked to at the bottom of this page. Related Projects @mikekohn.net
Explanation Of Brightness The brighter test program here reads in a BMP file and converts it to black and white. The image data is stored in a buffer of width * height bytes where each byte represents the brightness of each pixel (0 is black, 255 is white, and all numbers in between are shades of gray). To increase the brightness of an image, the value of every byte in the image buffer is increased by some value. To make the image darker, every byte in the buffer is decreased in value. For a color image, to be technically correct, the image needs to be in YUV format and the Y portion can be treated like a black and white image using this function. If this function were used on a standard RGB buffer, I don't think it would work properly, especially at the saturation points of the buffer, but it's worth a try? Converting to and from YUV from RGB is pretty computationally expensive. Explanation Of YUV422 to RGB YUV is another colorspace that can be used to represent image. An explanation of YUV can be found on Wikipedia's YUV page. YUV422 planer represents Y as single bytes in the top part of the buffer, while U is represented next at 1/2 the resolution of Y, and V is represented last at 1/2 the resolution of Y. For every 2 Y (brightness) bytes, there is 1 U (color) and 1 V (color). I wrote 3 different versions of YUV422 to RGB. The first one follows the exact formula of YUV as described on Wikipedia, the second uses a integer / shifting trick to get rid of some of the multiplication and floating points, and the third is based on the floating point version but written in total assembly language using SSE. I was actually able to almost double the speed of the original integer / shift version by using some simple lookup tables to get rid of all the multiplication and saturation, but I haven't posted that version. Maybe one of these days an SSE integer version would be interesting to try. How SIMD helps Brightness SSE adds eight 128 bit registers to the x86 instruction set. These registers can do load / store operations to and from memory 128 bit at a time (well, in one instruction at least), but when doing math operations the register it gets divided up into either sixteen single bytes, eight 16 bit words, four 32 bit double-words, four single precision floats, or 2 double precision floats. In the brightness example, every single byte of the xmm1 register is loaded with the same single byte value. In the brightness loop, xmm0 is loaded with the next 16 bytes in the buffer. Using paddusb (parallel add unsigned bytes with saturation), xmm1 is added to xmm0. Every byte of the xmm1 register is added to every byte of the xmm0 register. Because paddusb uses "saturation" if the resulting byte would overflow it's simply set to 255. The xmm0 is then written back to memory. Example: If the value passed to the function was 3, and the memory at the start of the image was 00 01 02 03 04 05 06 07 248 249 250 251 252 253 254 255 (a 16 pixel gradient from black to white):
xmm1 = 0x03030303030303030303030303030303
After the movdqa xmm0, [edi] instruction:
xmm0 = 0x0706050403020100fffefdfcfbfaf9f8
After paddusb xmm0, xmm1
xmm0 = 0x0a 09 08 07 06 05 04 03 ff ff ff ff fe fd fc fb
After movdqa [edi], xmm0 the memory at the address pointed to by edi would be:
03 04 05 06 07 08 09 10 251 252 253 254 255 255 255 255
Notes:
How SIMD helps YUV422 to RGB For the YUV422, I use SSE process 4 pixels at one time. I set up "vectors" of 4 floating points. In my example I have the following vectors:
VecY = [ Y0, Y1, Y2, Y3 ]
VecU = [ U0, U0, U1, U1 ]
VecV = [ V0, V0, V1, V1 ]
VecConst1 = [ 1.13983, 1.13983, 1.13983, 1.13983 ]
VecConst2 = [ -0.39466, -0.39466, -0.39466, -0.39466 ]
VecConst3 = [ -0.58060, -0.58060, -0.58060, -0.58060 ]
VecConst4 = [ 2.03211, 2.03211, 2.03211, 2.03211 ]
VectN128 = [ -128, -128, -128, -128 ]
Vec255 = [ 255, 255, 255, 255 ]
Vec0 = [ 0, 0, 0, 0 ]
So using the YUV to RGB formulas as described on the Wikipedia page, it's pretty simple to do the math on all the vectors in assembly language get the RGB pixels. In the image_proc download at the bottom of the page, the image_proc_sse.asm has a pretty well commented example of the finished function. Btw, I haven't finished optimizing this function yet, so I might get some more speed out of it later. C Version Of Brightness
void brightness(unsigned char *buffer, int len, int v)
{
int t, r;
if (v > 0)
{
for (t = 0; t < len; t++)
{
r = buffer[t] + v;
if (r > 255) { r = 255; }
buffer[t] = r;
}
}
else
{
for (t = 0; t < len; t++)
{
r = buffer[t] + v;
if (r < 0) { r = 0; }
buffer[t] = r;
}
}
}
SSE2 Version Of Brightness
global brightness_sse
section .code
bits 32
; void brightness_sse(unsigned char *image, int len, int v)
brightness_sse:
push ebp
push edi
mov ebp, esp
mov edi, [ebp+12] ; unsigned char *image
mov ecx, [ebp+16] ; int len
mov eax, [ebp+20] ; int v
jle bright_not_neg ; check if v is negative
neg al ; make al abs(v)
bright_not_neg:
shr ecx, 4 ; count = image_len / 16
mov ah, al ; make xmm1 = (v,v,v,v ,v,v,v,v, ,v,v,v,v, v,v,v,v)
pinsrw xmm1, ax, 0
pinsrw xmm1, ax, 1
pinsrw xmm1, ax, 2
pinsrw xmm1, ax, 3
pinsrw xmm1, ax, 4
pinsrw xmm1, ax, 5
pinsrw xmm1, ax, 6
pinsrw xmm1, ax, 7
test eax, 0xff000000 ; if v was negative, make it darker by abs(v)
jnz dark_loop
bright_loop:
movdqa xmm0, [edi] ; for every 16 byte chunks, add v to all 16 bytes
paddusb xmm0, xmm1 ; paddusb adds each 16 bytes of xmm0 by v but
movdqa [edi], xmm0 ; if the byte overflows (more than 255) set to 255
add edi, 16 ; ptr=ptr+16
loop bright_loop ; while (count>0)
jmp bright_exit
dark_loop:
movdqa xmm0, [edi] ; same as above but subtract v from each of the
psubusb xmm0, xmm1 ; 16 bytes that make up xmm0. if a byte will
movdqa [edi], xmm0 ; become negative, set it to 0 (saturation)
add edi, 16 ; ptr=ptr+16
loop dark_loop ; while (count>0)
bright_exit:
pop edi
pop ebp
ret ; return
Altivec on PowerPC I've started translating the SSE/x86 code to Altivec/PowerPC for MacOSX and the Cell CPU found in the Playstation 3. After benchmarking this the C code on Playstation 3 Linux, I was kinda disappointed with the results, so I translated it to straight PPC assembly and PPC+Altivec. Unfortunately, I did all the development on MacOSX using the "as" assembler which doesn't appear to be compatible with the "as" assembler on Playstation 3 Linux, so I have to rewrite it. The benchmark on the Mac G4 looks pretty good tho, I'll post the results soon. In the future i'm hoping to translate the code to one of the Cell's SPU's. I also plan on adding Altivec YUV422 to RGB. Note: I added PowerPC and Cell to naken_asm and wrote some sample SIMD Mandelbrot code. Source Code: image_proc_altivec.asm Performance The following table shows the difference between the C and SSE2 version of the brighter() function. The time represents how long it took to read in the bmp, call the brighter routine 100,000 times, and then write out a modified bmp. Note: Performance differences could be due to memory bus speed and not to processor speed. I can't remember what speed of memory are in these two boxes, but the AMD64 box is a laptop which typically have slower memory. Brightness Adjust (100,000 iterations on a 352x240 image)
YUV422 to RGB (10,000 iterations on a 704x480 image)
Note: -m32 tells gcc to compile for a 32 bit cpu while -m64 says to compile for 64 bit I made a multithreaded version of the yuv2rgb.c. It breaks up the 10,000 interations over multiple threads. YUV422 to RGB (10,000 iterations on a 704x480 image) 64 bit compiled C code only
Download
image_proc-2007-04-23.tar.gz I've recently started cleaning up this code and moving it to a git repository. I added some AVX2 examples and plan on possibly adding AVX-512. https://github.com/mikeakohn/simd_examples
Copyright 1997-2024 - Michael Kohn
|