This module exposes vendor-specific intrinsic that typically correspond to a single machine instruction. The arch module is intended to be a low-level implementation detail for higher-level APIs.
In order to call these APIs safely there's a number of mechanisms available to ensure that the correct CPU feature is available to call an intrinsic. The first option available to us is to conditionally compile code via the # attribute.
Here we're using # to conditionally compile this function into our module. The unsafe block here can be justified through the usage of # to only compile the code in situations where the safety guarantees are upheld.
Instead, you might want to build a portable binary that runs across a variety of CPUs, but at runtime it selects the most optimized implementation available. This allows you to build a “least common denominator” binary which has certain sections more optimized for different CPUs.
Taking our previous example from before, we're going to compile our binary without AVX2 support, but we'd like to enable it for just one function. Provided by the standard library, this macro will perform necessary runtime detection to determine whether the CPU the program is running on supports the specified feature.
In this case the macro will expand to a boolean expression evaluating to whether the local CPU has the AVX2 feature or not. To ensure we don't hit this error a statement level # is used to only compile usage of the macro on x86 / x86_64.
The primary purpose of this module is to enable stable crates on crates.Io to build up much more ergonomic abstractions which end up using Sims under the hood. Over time these abstractions may also move into the standard library itself, but for now this module is tasked with providing the bare minimum necessary to use vendor intrinsic son stable Rust.
First let's take a look at not actually using any intrinsic but instead using LLVM's auto-vectorization to produce optimized vectorized code for AVX2 and also for the default platform. If you read the announcement, you will see that Sims should bring performance enhancements to our applications if we learn how to use it properly.
If you feel fairly comfortable with Rust but are still having issues following this text, you might want to read my book about improving the performance of your Rust applications. Well, we usually don’t have only one multiplication in our code, we most of the time will do these operations in iterations, so it would be nice to be able to perform them in parallel with only one instruction every 2, 4, 8 or even more of them.
Different kinds of Sims instructions will allow us to do that for our various operations. This algorithm has 6 variations, but for our example here, we will just take the main variant into account.
At least if we don’t need to calculate the position for the planet many times per second, which could be a real use case in a simulation, for example. First, we should know that the VSOP87 algorithm provides some huge data-sets of constants that are used in the calculation of those variables.
For each variable (a, a, a, a, a, LA, LA, LA …) we have one bi-dimensional matrix or array for each planet. Where v is one of a, a, a, LA, LA… and n is the number of rows in the matrix / array. This formula might be a bit complex, but let’s see what it’s doing.
For each 3 elements in each matrix / array row (we call them Via, Via and Via, or simply a , b and c in the code) in, we calculate a * (b + c × t).cos(), (note that this is Rust notation) and then we just sum all of them. And this is where what we saw before gets handy: this function can be optimized with Sims, since we are performing multiple operations that could be done in parallel.
Sims is the general name that receive multiple parallel computing implementations for different CPUs. In the case of Intel, we have SSE and AVX implementations, each of them with different versions (SSE, SSE2, SSE3, SSE4, AVX, AVX2 and AVX-512), ARM has Neon instructions, and so on.
Rust enables SSE and SSE2 optimizations for x86 and x86_64 targets by default. In any case, these optimizations are done by the compiler, and it’s not as good as we as programmers can be.
With Rust 1.27, we can use SSE3, SSE4, AVX and AVX2 manually in the stable channel. AVX functions start with _mm256_, then, they get the name of the operation (add, mud or abs, for example) and then the type they will be used on (_pd for doubles or 64-bit floats, _PS for 32-bit floats, _epi32 for 32-bit integers and so on).
This will iterate through the V array, called var, and will for each row, add the result of a * (b + c × t).cos(), just what we need. () macro will check at runtime if the current CPU has the AVX instruction set.
This function, as you can see, receives 4 tuples (a, by, CD) and the t variable. It will return the 4 intermediate terms after computing a * (by + CD * t).cos() for each of the tuples.
Then, we need to compute the cosine of the 4 results, but Rust does not provide the Intel _mm256_cos_pd() instruction yet. Here, we have to take something into account: x86/x86_64 is a little Indian architecture, which means that the bytes will be stored with the highest value at the lowest index.
What means that when we unpack the BCT vector, the first element will be by + CD * to, instead of by + CD * to. The way to do it, is to use the chunks() iterator in the array, so that we can get 4 rows each time.
We are asking the Rust compiler to enable the AVX feature for this particular function. The first and obvious pattern is having a chunk of 4 tuples that we can directly use in the vector_term() function we defined earlier.
Note that this could be SIMD-optimized by taking elements 8 by 8, adding them 4 by 4, and then 2 by 2 and so on, but it's out of the scope of this explanation. AVX-512 should clearly improve this benchmark, and being able to compute the cosine in AVX should also help.