The first installment is an example of using parallel ARM assembler in XCode to do basic processing of an ARGB image. I'll post a follow up with more detail. In summary, for each iteration, the routine
- reads 16 bytes into 4, 32-bit registers (r2-r5) in 1 ldmia (load multiple) instruction,
- processes the 16 bytes in 4 uqadd8 instructions, which equates to 4 pixels (ARGB).
- stores the 16 bytes back to memory, and increments the buffer point in 1 stmia instruction