I am in the process of writing a bunch of applications for my desktop environment. The plan is to have, at a minimum:
- a status bar
- a menu
- a notification daemon
- a terminal
Obviously I am aware the terminal will probably take very long, but I plan on working on it little by little over a long time. These applications are not meant for anything other than my own personal use, but if you would like to accompany their development, just have a look at this repository.
In any case, all the above applications require text rendering, which is in itself a big can of worms. We will ignore all of that and instead focus only on the final step: alpha blending the glyph's grayscale bytemap onto a background.
Using floats
Alpha blending two 8bit values using an 8bit alpha channel can be done like so:
fn alpha_blend_with_float(dst: &mut u8, src: u8, alpha: u8) {
// put alpha in 0..1.0 range
let a = alpha as f32 / 255.0;
// get the complement
let ac = 1.0 - a;
// multiply each component by how much they should
// "contribute" to the final value
*dst = ((*dst as f32 * ac) + (src as f32 * a)).round() as u8;
}
The code is straightforward and intuitive: we are doing a simple weighted mean. The alpha channel tells us the weights, we normalize it to a (0..255.0)
range and then just take the mean. We can make it a little faster by multiplying by the inverse 1.0 / 255.0
instead of dividing. In many cases, this could cause rounding differences, but it works fine with the values we will be using. An interesting feature of this problem is that the problem space is very small: total cases, meaning we can easily exhaustively verify all of them (which is what I did).
This code can be rewritten with SIMD instructions, so that it blends multiple values at once (here we multiply by the inverse):
/// Parameters are colors in RGBA format as u32 (one byte per channel)
unsafe fn alpha_blend_float_sse41(dst: &mut u32, color: u32, alpha: u8) {
use std::arch::x86_64 as intr;
// constants
let one = intr::_mm_set1_ps(1.0);
let mul = intr::_mm_set1_ps(1.0 / 255.0);
// setup alpha value and its complement
let simd_alpha = intr::_mm_set1_epi32(alpha as i32);
let simd_alpha = intr::_mm_cvtepi32_ps(simd_alpha);
let simd_alpha = intr::_mm_mul_ps(simd_alpha, mul);
let simd_alpha_compl = intr::_mm_sub_ps(one, simd_alpha);
// load each color and turn it into a float
let simd_color = intr::_mm_set1_epi32(color as i32);
let simd_color = intr::_mm_cvtepu8_epi32(simd_color);
let simd_color = intr::_mm_cvtepi32_ps(simd_color);
let simd_dst = intr::_mm_set1_epi32(*dst as i32);
let simd_dst = intr::_mm_cvtepu8_epi32(simd_dst);
let simd_dst = intr::_mm_cvtepi32_ps(simd_dst);
// do the operation
let dst_mul = intr::_mm_mul_ps(simd_dst, simd_alpha_compl);
let src_mul = intr::_mm_mul_ps(simd_color, simd_alpha);
let sum = intr::_mm_add_ps(dst_mul, src_mul);
// Transform the result back into u8s.
let result = intr::_mm_cvtps_epi32(sum);
let result = intr::_mm_packus_epi32(result, result);
let result = intr::_mm_packus_epi16(result, result);
let mut tmp = 0u64;
intr::_mm_storel_epi64(std::ptr::addr_of_mut!(tmp).cast(), result);
*dst = tmp as u32;
}
This will by itself give us a nice performance boost (all measurements were taken with a AMD Ryzen 9 7950X CPU):
Float Scalar | 0.58802708 GB/s |
---|---|
Float SIMD sse4.1 | 3.29492197 GB/s |
A surprising improvement: we got around 6x more performance, despite us operating on 4 values at a time. The main reason is the Rust compiler (version 1.89.0) unfortunately does not generate very good code for the scalar version, making the difference higher than one would expect.
Extending this code to work with avx2 is a little annoying because we need to make sure the alpha values are in the correct bytes before converting them into floats:
let shuffle_mask = intr::_mm256_set_epi8(
-128, -128, -128, 1, -128, -128, -128, 1,
-128, -128, -128, 1, -128, -128, -128, 1,
-128, -128, -128, 0, -128, -128, -128, 0,
-128, -128, -128, 0, -128, -128, -128, 0,
);
let simd_alpha =
intr::_mm256_set1_epi16(alpha.as_ptr().cast::<i16>().read());
let simd_alpha = intr::_mm256_shuffle_epi8(simd_alpha, shuffle_mask);
let simd_alpha = intr::_mm256_cvtepi32_ps(simd_alpha);
And then we also need to put in some extra work when converting the values back:
let result = intr::_mm256_cvtps_epi32(sum);
let hi = intr::_mm256_extracti128_si256::<1>(result);
let lo = intr::_mm256_castsi256_si128(result);
let packed = intr::_mm_packus_epi32(lo, hi);
let packed = intr::_mm_packus_epi16(packed, packed);
This extra stuff is the main reason we cannot achieve exactly 2x improvement with avx2, but we get reasonably close:
Float Scalar | 0.58802708 GB/s |
---|---|
Float SIMD sse4.1 | 3.29492197 GB/s |
Float SIMD avx2 | 5.99929942 GB/s |
Now, the most natural thing to do would be to continue extending this and implement the same code using avx512 and its 512bit registers. But we can do better than this. The main problem with this method is that it must use 4 bytes (an f32) to blend a single byte. If we could use just 2 bytes instead, we could double the amount of colors we can blend at once. So, let's try doing that.
Using u16
We can calculate the same weighted mean through integer multiplication and division, making sure to do the correct rounding like so:
(y * alpha + (x * (255 - alpha)) + 127) / 255
Multiplying two bytes gives a result with at most 16bits, and so we can do the above algorithm with half the number of bytes. Unfortunately, we are now using an integer division instruction, which is pretty slow, and, more importantly, does not exist in x86_64 SIMD. However, there are ways of calculating this division without actually doing the division. The method we will use is described here, does the correct rounding, and is composed of the following operations:
x += 0x80;
((x >> 8) + x) >> 8
Now, because we are blending colors with 4 channels (RGBA), we will always have at least 4 bytes to blend. This would take a total of 8 bytes, which suggests we can use a SWAR approach to blend all 4 bytes at once. Here's what that looks like:
/// transforms 4 bytes RGBA into 8 bytes 0R0G0B0A
fn as_color16(color: u32) -> u64 {
let x = color as u64;
let x = ((x & 0xFFFF0000) << 16) | (x & 0xFFFF);
((x & 0x0000FF000000FF00) << 8) | (x & 0x000000FF000000FF)
}
fn alpha_blend_swar(dst: &mut u32, color: u32, alpha: u8) {
let alpha = alpha as u64;
let alpha_compl = 0xFF ^ alpha;
let new = as_color16(color);
let orig = as_color16(*dst);
let res16 = new * alpha + orig * alpha_compl + 0x0080008000800080;
let res8 = res16 + ((res16 >> 8) & 0x00FF00FF00FF00FF);
// transform the result back to 32 bytes
let res = (res8 >> 8) & 0x00FF00FF00FF00FF;
let res = (res | (res >> 8)) & 0x0000FFFF0000FFFF;
let res = res | (res >> 16);
*dst = (res & 0x00000000FFFFFFFFF) as u32;
}
This new method is actually slower than the Float SIMD with sse4.1:
Float Scalar | 0.58802708 GB/s |
---|---|
Float SIMD sse4.1 | 3.29492197 GB/s |
Float SIMD avx2 | 5.99929942 GB/s |
SWAR | 2.59967324 GB/s |
It is still around 4.5 times faster than the original float scalar version, but the biggest advantage is that it does not depend on sse4.1 instructions. In fact, it would work in any processor that supports 64bits operations. It is possible to gain a 2% speedup if you were to use pdep
and pext
to transform the values into 64bits and back, but that is not enough to beat the sse4.1 version, and sse4.1 is an earlier instruction set anyway, so there is little point in doing that.
Instead, this new method shines when we implement it using SIMD, because now we can operate on twice as many elements at once:
unsafe fn blend_swar_sse41(dst: &mut [u32], color: &[u32], alpha: &[u8]) {
use std::arch::x86_64 as intr;
let dst = dst.as_mut_ptr().cast::<u64>();
let shuffle_mask = intr::_mm_set_epi8(
-128, 1, -128, 1, -128, 1, -128, 1,
-128, 0, -128, 0, -128, 0, -128, 0,
);
let simd_alpha =
intr::_mm_set1_epi16(alpha.as_ptr().cast::<i16>().read());
let simd_alpha = intr::_mm_shuffle_epi8(simd_alpha, shuffle_mask);
let ones = intr::_mm_set1_epi16(0x00FF);
let simd_alpha_compl = intr::_mm_xor_si128(ones, simd_alpha);
let simd_color = intr::_mm_loadl_epi64(color.as_ptr().cast());
let simd_color = intr::_mm_cvtepu8_epi16(simd_color);
let simd_dst = intr::_mm_loadl_epi64(dst.cast());
let simd_dst = intr::_mm_cvtepu8_epi16(simd_dst);
let e1 = intr::_mm_set1_epi16(0x0080);
let e2 = intr::_mm_set1_epi16(0x0101);
let color_alpha = intr::_mm_mullo_epi16(simd_color, simd_alpha);
let dst_alpha = intr::_mm_mullo_epi16(simd_dst, simd_alpha_compl);
let sum = intr::_mm_adds_epu16(color_alpha, dst_alpha);
let sum = intr::_mm_adds_epu16(sum, e1);
// This mulhi is equivalent to the ((x >> 8) + x) >> 8 operation.
// (can you see why?)
let res = intr::_mm_mulhi_epu16(sum, e2);
let final_i = intr::_mm_packus_epi16(res, res);
intr::_mm_storel_epi64(dst.cast(), final_i);
}
By using _mm_cvtepu8_epi16
, we can do the RGBA
→ 0R0G0B0A
transformation without all the bit operations we had to use in the scalar version. The reverse transform is also a simple manner of calling _mm_packus_epi16
at the end. As a result, despite only doubling the amount of processed bytes, we end up with almost 3x speedup:
Float Scalar | 0.58802708 GB/s |
---|---|
Float SIMD sse4.1 | 3.29492197 GB/s |
Float SIMD avx2 | 5.99929942 GB/s |
SWAR | 2.59967324 GB/s |
SWAR SIMD sse4.1 | 6.97543461 GB/s |
Extending this to work with avx2 and avx512 involves changing the shuffle mask and the final conversion code. I will spare you the details this time (the full code is available here) and just give you the final results:
Float Scalar | 0.58802708 GB/s |
---|---|
Float SIMD sse4.1 | 3.29492197 GB/s |
Float SIMD avx2 | 5.99929942 GB/s |
SWAR | 2.59967324 GB/s |
SWAR SIMD sse4.1 | 6.97543461 GB/s |
SWAR SIMD avx2 | 13.70369486 GB/s |
SWAR SIMD avx512 | 19.46538487 GB/s |
The scaling from sse4.1 to avx2 is nearly perfect. For avx512, it could be the extra work we have to do at the very end (we have to permute the bytes because they are spread across multiple lanes), or the fact that zen4's avx512 implementation isn't really the greatest. Still, a 1.5x improvement isn't too bad!
Can we do better?
There may be better ways of writing this SIMD code, though I believe it is unlikely we will see any major improvements. Rather, the best way forward would be to try finding a way of doing alpha blending using just the 8bits. Then, we could once again potentially double our throughput. There is no mullo
(or any multiply, for that matter) for 8bits in x86_64
intrinsics. There is however an average instruction that rounds up that could be useful. So we must be able to implement alpha blending with roughly just:
- right and left shifts
- XOR, OR and AND bit operations
- addition and subtraction mod 256
- saturating addition and subtraction
- max and min
- an average instruction that rounds up
There are of course far more instructions in x86_64
that do pretty complicated operations, but these are the ones I believe would be most useful. Regrettably, try as I might, I was not able to come up with a function that worked using just these operations. If anyone knows of one such method, please contact me!
Full benchmark code available here.