Update benchmarks to focus on conversion performance.
Benchmark each interpolation function for RGBA and YCbCr images.
Reduce the image size used when benchmarking to reduce the influence of
memory performance on results.
Optimize data-locality for a huge increase in processing speed.
This is a complete rewrite! The tight scaling loop needs data locality for optimal performance. The old version used lots of pointer redirections to access image data which was bad for data locality. By providing the complete loop for each image type, this problem is solved. Unfortunately this increases code duplication but the result should be worth it: I could measure a ~6x speed-up for certain test cases!