All we have to do is track whether each buffer ends on a space or a non-space character and keep a running tally of bytes, words and lines. We’re still iterating over everything twice, but at this point, we’re pretty much toe-to-toe with the C WC utility in terms of both performance and memory usage.
Now that we’ve reached parity, the question becomes: how can we achieve even better performance than ? However, I quickly realized the approach used by Chris in Haskell would translate very nicely to the functional programming world of Rust, and I quickly set about creating a “Flux” type as Chris did.
When operating on two Option
A functional approach to computing the span over an entire byte string by spanning its constituents. Obviously, we don’t expect this to be any faster than before as it doesn’t change our level of parallelism. However, by re-expressing the problem in a way that lends itself to arbitrary levels of parallelism, we can take advantage of the easiest performance boost you’re likely ever to see: rayon.
With just a simple three-line change to the function above, we can leave it to Rayon to automatically spread the work performed here across all available CPU cores. For me, the reason these types of explorations are valuable are because it forces me to think outside the box.
Even if we don’t have time to go out and learn every programming language, exploring some of the most “different” one's out there, understanding their approaches and the value they each bring to the table, and incorporating the good bits into our work is how we continue to grow as engineers. We strive to enable teams to work at the speed of thought, and our multiplayer syncing engine is a critical part of this vision.
Rust is similar to C++ in performance and low-level ability but has a type system which automatically prevents whole classes of nasty bugs that are common in C++ programs. We chose Rust for this rewrite because it combines best-in-class speed with low resource usage while still offering the safety of standard server languages.
Low resource usage was particularly important to us because some performance issues with the old server were caused by the garbage collector. We think this is an interesting case study of using Rust in production and want to share the issues we encountered and the benefits we achieved in the hope that it will be useful to others considering a similar rewrite.
That means each worker is responsible for some fraction of currently open Figma documents. The main problem with the old server was the unpredictable latency spikes during syncing.
The server was written in TypeScript and, being single-threaded, couldn’t process operations in parallel. That meant a single slow operation would lock up the entire worker until it was complete.
This kept the service up but meant we had to continually look out for crazy documents and move them over to the heavy worker pool by hand. It bought us enough time to solve these problems for real, which we did by moving the performance-sensitive parts of the multiplayer server into a separate child process.
And serialization time is now over 10x faster so the service is now acceptably fast even in the worst case. While Rust helped us write a high performance server, it turns out the language wasn’t as ready as we thought.
As a result, we dropped our initial plan to rewrite our whole server in Rust and chose to focus solely on the performance-sensitive part instead. Rust combines fine-grained control over memory layout with the lack of a GC and has a very minimal standard library.
Rust ’s slices make passing raw pointers around easy, ergonomic, and safe, and we used that a lot to avoid copying data during parsing. Rust comes with cargo built-in, which is a build tool, package manager, test runner, and documentation generator.
Rust is more complex than other languages because it has an additional piece, the borrow checker, with its own unique rules that need to be learned. People have put a lot of effort into making the error messages readable and it really shows.
This guarantees safety but is overly restrictive since the variable may not be needed any more by the time the mutation happens. Even as someone who has been following Rust from the start, who writes compilers for fun, and who knows how to think like the borrow checker, it’s still frustrating to have to pause your work to solve the little unnecessary borrow checker puzzles that can come up regularly as you work.
Then a pointer will no longer prevent the mutation of the thing it points to for the rest of the scope, which will eliminate many borrow checker false-positives. Error-handling in Rust is intended to be done by returning a value called “Result” that can represent either success or failure.
We tried using two separate Rust compression libraries that were both used by Servo, Mozilla’s next-generation browser prototype, but both had subtle correctness issues that would have resulted in data loss. Our multiplayer server talks over Sockets and makes HTTP requests every so often.
Our multiplayer server is a small amount of performance-critical code with minimal dependencies, so rewriting it in Rust even with the issues that came up was a good trade off for us. This post explores exactly this question, detailing first the steps needed for generating a Pooed versions of rust (in two flavors), and then taking a look at the resulting performance implications.
This article will explore the thread-per-core model with its advantages and challenges, and introduce Scipio (you can also find it on crates.Io), our solution to this problem. Scipio allows Rust developers to write thread-per-core applications in an easy and manageable way.
Back in spring 2020 at Gout, we were looking to replace our Spring-Tomcat duo by a more lightweight framework to power our future Kotlin microservices. We did some detailed (at times philosophical) theoretical comparisons that I much enjoyed, but these cannot substitute a hands-on experience.
We decided to implement proof-of-concept microservices using the most viable frameworks, stressing them in a benchmark along the way. While Kotlin was the main language, I saw this as an opportunity to have some fun at home and test (my proficiency with) Rust, which is touted for being fast.
In the previous post, I showed how processing file data in parallel can either boost or hurt performance depending on the workload and device capabilities. This is typically solved by scheduling tasks of different types on dedicated thread pools.
Others are around doubt about whether intermediate layers are inflating or shifting numbers in unfair ways. When I find that I can 't easily wedge in a profile, I get a bit sad and then turn to crude solutions.
Your average battle-tested firmware developer has accrued a healthy distrust of the abstract, probably born of watching shiny Platonic constructs crash and burn with painfully real and concrete error traces. It is sobering, having to chase a hard fault on a tiny MCU across enough tables and template code to make Herb Butter puke angle brackets.
No wonder modern approaches are met with some resistance unless the Loads and the Stores are in clear view. I felt this way too when someone suggested to me, back in 2014, that an up-and-coming language called Rust showed promise in the embedded field.
Even though I had been playing with it already, my profoundly ingrained bit-twiddling instincts told me not to trust a language that supported functional programming, or one that dared to have an opinion on how I managed my memory. That's a good amount of time to become familiar with some troubles of the language, the tooling, and the ecosystem.
Our stack is made up of services written in Node.js, Ruby, Elixir, and a handful of others in addition to all the languages our agent library supports. This post goes into some details that caused the need to change languages, as well as some decisions we made along the way.
This guide covers timeless ideas that are helpful to keep in mind while working with systems where performance matters. Many of these ideas are fairly “durable” and will apply regardless of what hardware, programming language, operating system, or decade you are working in.
You can add layers of abstraction, e.g. wrapping a primitive value in a struct and providing a specialized API for it, without adding run-time overhead. Most Rust programmers have heard of Rayon, a crate that makes it almost magically easy to introduce parallelism to a program.
In this article we’ll examine how to apply Rayon to basic stream processing. Each Tuesday I have been looking at the performance results of all the PRS merged in the past week.
The goal of this is to ensure that regressions are caught quickly and appropriate action is taken, and to raise awareness of performance issues in general. Python Data-oriented design is an approach to optimizing programs by carefully considering the memory layout of data structures, and their implications for auto-vectorisation and use of the CPU cache.
I highly recommend watching Mike Acton’s “Data-Oriented Design and C++” talk if you haven’t seen it already. As part of an Undergraduate Research Assistant Scheme in my first year of university I was tasked with parallelizing a piece of shallow water simulation software written in FORTRAN by Dr David Ditches of the Vortex Dynamics Research Group, under supervision of Dr. Alexander Inovalon, at the University of St Andrews.
There were secondary goals such as improving the testing infrastructure, setting up CI/CD, estimating progress and allowing the computation to be paused and resumed. Forewarning: I have essentially zero domain knowledge in this project (and fluid dynamics simulation isn’t exactly the kind of topic you can catch up to research level on over a weekend) so I approached this project from a pure software engineering perspective.
The asynchronous/await keywords in modern Rust make building high-throughput daemons pretty straightforward, but as I learned that doesn’t necessarily mean “easy.” Last month on the Scribe tech blog wrote about a daemon named hot dog which we deployed into production: Ingesting production logs with Rust. In this post, I would like to write about some technical challenges I encountered getting the performance tuned for this async-std based Rust application.
We’ll take a look at using compiler intrinsic to do it in log(n) time. In the previous article on auto-vectorization we looked at the different Sims instruction set families on X86-64.
We saw how he target-feature compiler flag and # attribute gave us more control over the instructions used in the generated assembly. There is a related compiler flag target-cpu we didn’t touch on, so it’s worth taking a look at how it affects the generated code.
Needless to say, it turned out a bit different from originally planned, but I’ve something I’d like to share. It’s an early release and will need some more work, but it can be experimented with (it has documentation and probably won’t eat any kittens when used).
If I don't have the money for a faster laptop, maybe I could build a really fast Rust development server on AWS? This is a handy pattern to use for efficiently escaping text, and it's also a good demonstration of Rust's Cow type.
Rust comes with several new idioms and structures in the language I am not used to, and being a performance enthusiast, I always get interested in what such constructs translate to. If our goal is to get the best performance we need to take advantage of all the possible Sims instructions on our hardware.
* Look at the compiler output when targeting the different Sims instruction set families. Along the way we're going to learn about optimization using Single Instruction Multiple Data CPU instructions, how to quickly check the assembler output of the compiler, and simple changes we can make to our Rust code to produce faster programs.
Simd When working on Rust applications or CLI's that need to show something to the end user as fast as possible I often find that a significant chunk of the time is usually spent not in doing any computations, but in dropping large data structures at the end of the function. They are various reasons explaining why Rust binaries are generally bigger that ones produced with lower level languages such as C.
The main one is Cargo, Rust's package manager and building tool, producing static binaries by default. While larger binaries are generally not much of an issue for desktop or server applications, it may become more of a problem on embedded systems where storage and/or memory may be very limited.
Streamer is used extensively at Collaborate to help our clients to build embedded multimedia solutions. With Rust gaining traction among the Streamer community as an alternative to C to write Streamer applications and plugins, we began wondering if the size of such Rust plugins would be a problem for embedded systems, and what could be done to reduce sizes as much as possible.
This has resulted in some large performance wins (e.g. 10x-100x) and a significant improvement in code quality. The story involves Rust, Python, executable and debug info formats, Task cluster, and many unexpected complications.
Fast iteration times are something that many game developers consider to be of utmost importance. Keeping build times short is a major component of quick iteration for a programmer.
Rust compile times are known to be a bit slow compared to many other languages, and I didn’t want to pour fuel on to that particular fire. As part of writing glam I also wrote mathbench, so I could compare performance with similar libraries.
I also always wanted to include build time comparisons as part of math bench and I’ve finally got around to doing that with a new tool called build bench. In this post I'd like to introduce the next major change that will be released in algebra at the end of this month (March 2020).
To give you an idea, Sims Also is actually what the recent ultraviolet crate has been using to achieve its amazing performances. Let's continue with the enormous SVG from the last time, a map extracted from OpenStreetMap.
According to Massif, peak memory consumption for that file occurs at the following point during the execution of rsvg-convert. In my last post I mentioned that pa’I was faster than Olin’s CWA binary written in go without giving any benchmarks.
A lot of existing benchmark tools simply do not run in Web Assembly as is, not to mention inside the Olin ABI. Wasm To collect profiling statistics for Rust programs like TIV, we developed profits, which samples, analyzes, and visualizes performance data in one step.
This makes it easier for developers and online users to find TIV's performance bottlenecks. We got a bug with a gigantic SVG of a map extracted from OpenStreetMap, and it has about 600,000 elements.
To continue with last time's topic, let's see how to make library's DOM nodes smaller in memory. And with it the whole concept of learning software development outside the education system (e.g., the good old courses, exercises and sitting in a class being taught by a teacher).
Please remember that the following suggestions do not replace actual profiling and optimizations! I also think it goes without saying that the only way to detect if any of these helps is having benchmarks that represent how your application behaves under real usage.
I recently ran some benchmarks on a Thread ripper 3960X system and the results were surprising me quite a bit. Prior to I had read Daniel Bemire's notes on the suboptimal performance for Simpson on Zen 2, which is heavily used in the benchmark, but the suggested drop were a few percent not half.
This is harder than it looks, since we can 't just allow the parallel threads to update the auxiliary vectors. This blog post and git describes my search for a way to parallelize simulation of cars, intersections and roads in a computer game project.
I wrote down some details on the steps I take when optimizing the Rust compiler, using an improvement I just made to LEB128 reading/writing as an example. In this blog article, I discuss this specific feature, and give you an example of how Rust is using it to deliver optimized code of your abstracted projects.
This piqued my interest in looking into what kind of assembly code is being generated in the final binary. In a previous post, we saw how Rust Web Assembly can be integrated into a JavaFX project using the Amble tool.
This post explains why it made sense for us to reimplement the service, how it was done, and the resulting performance improvements. I started looking for a performant Damerau-Levenshtein crate, and ended up writing my own collection of optimized edit distance algorithms, Eddie.
This post will focus on a couple of such benchmarks pertaining to blocking operations on otherwise asynchronous runtimes. Along the way I’ll give only sparse background on these projects I’ve been working on, but plenty of links if you are interested in reading further.
Luca Bruno asked me to give more context, this is the belated start of a set of blog posts about optimizing rust code. In particular, I'm going to rely on compiler auto-vectorization to produce a program that is shorter, simpler, portable, and significantly faster ... and without any unsafe.
Unsafe (at least on commodity desktop Linux with stock settings) In the comments, I was pointed to this interesting article, which made me realize that there’s another misconception, “For short critical sections, spin locks perform better”.
With the web service implementations in place, I evaluated them using an off-the-shelf benchmarking tool. We've just released cargo-bisect-rustc, a tool that makes it super easy to find exactly when the regression happened.
This article presents a comparison of HTTP client apps in Node JS and Rust and looks at different aspects of those projects such as CPU and memory metrics, build time and distribution package size. What you need to learn is the example on the README of PyO3 and rewriting the related function in Python to Rust.
As a matter of fact, I firstly used rust Python, but I ran into some issues on macOS, hence I switched to PyO3 and it works smoothly. One of the main challenges in writing a 64K intro is to squeeze all the code and assets into 64K of memory.
There are several tiny frameworks written in C++ that set up a Windows app with a modern OpenGL context. I could not find anything similar for Rust, so I decided to create the smallest possible bare-bones app that does just that.
Today the language of choice for Machine Learning is Python (unless your working environment has some unusual constraints). Hopefully, at the end of it, using Rust as a training backend and deployment platform will not look as crazy or confusing as it sounds.
Rav1e’s memory footprint makes it a good starting point for new-cases like Software Encoding in ARM devices, Real-time streaming, while LIBOR and SVT-AV1 are either too slow or resource-intensive. In the end, we are getting around ~12-20% Improvement in Encoding Time and FPS which is the first step making adoption of AV1 to Mobile devices.
Our priority was proving the infrastructure for merging, testing and benching on ARM Devices feasible and now it's more realistic. With Moore’s law coming to an end, optimizing code to avoid performance pitfalls is becoming more and more useful.
When that is not sufficient, profilers like per fare useful to measure where the code is slow and therefore which algorithms and data structures should be optimized. Slip space drives are how the ships in the Halo universe travel so quickly to different sectors of the galaxy through something called Slipstream Space, so thought it was cool for a name requiring awesome warp API speeds.
In this tutorial, we will implement a Rust program that attempts to utilize 100% of the theoretical capacity of three relatively modern, mid-range CPUs. We'll use an existing, highly efficient C++ implementation as a reference point to compare how our Rust program is doing.
Although it is perhaps less natural to think about, it is more efficient than incrementing the bump pointer and allocating from lower addresses up to higher ones. It’s something of a meme lately to see whether your programming language of choice can take on the venerable WC, and what that might look like.
So I’ve decided to dust off a back burner project for a while, and figure out just where rust spends most of its time. I chose a floating point benchmark to implement in both languages in order to see the performance difference.
I give commentary on why the performance is that way, and some potential fixes Rust could implement to close the gap. We’ve been hard at work on the next major revision of Tokyo, Rust ’s asynchronous runtime.
Lately, there has been talk about improving build times, with a focus on reducing bloat like regex breaking out logic into features that can be disabled, cargo-bloat going on a diet, new cargo features to identify slow-to-build dependencies. The idea behind it is that you want to measure how a speed up of a certain function would impact the runtime as a whole, which can be very counterintuitive in today’s multi-threaded world.
This blog post might interest three type of readers: people interested in tastily: You’ll learn how tastily uses Sims instructions to decode posting lists, and what happens on platform where the relevant instruction set is not available. Lucene core devs (yeah it is a very select club) who might be interested in a possible (unconfirmed) optimization opportunity.
Version 0.3 provides a number of new features including preliminary support for plugging in custom measurements (e.g. Rust already has popular crates (Tokyo, act ix) that provide asynchronous concurrency, but the asynchronous syntax coming to stable in 1.39 is much, much more approachable.
My experience has been that you can produce and reason about application flow much more easily, which has made me significantly more productive when dealing with highly concurrent systems. To kick the tires of this new syntax I dug into the nightly branch, and built a high-performance TCP client called clobber.
The issue to stabilize an initial version of asynchronous/await in Rust has left final comment period. One of the blockers mentioned in the RFC is the size of the state machines emitted by asynchronous FN.
I’ve spent the last few months tackling this problem, and wanted to give people a window into the process of writing these optimizations, with all the intricacies involved. Glam is a simple and fast Rust linear algebra library for games and graphics.
Mathbench is a set of unit tests and benchmarks comparing the performance of glam with the popular Rust linear algebra libraries cg math and algebra. The following is a table of benchmarks produced by math bench comparing glam performance to cg math and algebra on f32 data.
I converted the fastest (dating early 2019) n-body C -implementation (#4) to Rust (#7) in a one-to-one fashion, gaining a performance encasement by factor 1.6 to my own surprise. This is a subjective, primarily developer-ergonomics-based comparison of the three languages from the perspective of a Python developer, but you can skip the prose and go to the code samples, the performance comparison if you want some hard numbers, the takeaway for the tl;dr, or the Python, Go, and Rust diffing implementations.
This series of blog posts measures and compares the performance of rustle (a TLS library in rust) and OpenSSL. The project I chose as mentioned in the title is jailers, the rust implementation of a popular Chinese word segmentation library: Jail.
There is much exploration, and a number of promising projects, but I also think we don’t yet know the recipe to make GUI truly great. In my work, I have come across a problem that is as seemingly simple, yet as difficult to get right, as making decent tea: handling smooth window resizing.
It’s also pretty easy to test (as opposed to requiring sophisticated latency measurements, which I also plan to develop). For a recent personal project, I had only needed a fairly simple node.js server to do exponential and costly computing tasks.
To be honest, I could have switched the entire tech stack, but I estimated that the development time of such a choice wasn’t worth it… Still, I had some functions taking ages to compute. Programming ecstasy is a double-edged sword and writing slow Ruby is as easy as it is pleasant.
As part of a project I’m working on, I sometimes find myself having to deal with quite large X12 files. Since I’m dealing with large source files it would also be nice if it was at least as fast as standard tools like used.
Benchmarking parallel query execution by manually creating one execution context per parquet partition and running on a thread, just to get an idea of expected performance, and comparing results to Apache Spark (running in local mode). When optimizing code, one thing I’m always looking for is memory layout and access patterns.
One such pattern is an arena: Reserve some sufficiently large space to put your objects in, then allocate by incrementing a pointer. I recently decided to learn more about Rust, and wrote a high performance Raptor (RFC6330) library.
I recently finished a detailed review of hash brown, which will likely become the new implementation for rust's std::collections::Yashmak. But, the key limitation I found was the lack of an ergonomic linear algebra library.
Yet, I found none of them at the time ergonomic to work with, nor fast in comparison to writing the lower-level Sims, Bias, and La pack code (I have picked up array more in recent weeks and months). Several models that make it easier to reason about concurrent programs have been envisioned over time.
I don't intend to give an extensive analysis of each solution, or make a formal comparison between them. My intention is to simply explain the basics of each solution and how they can be used in practice (with code samples that show off what the result of using the models might look like), so that other developers may have an easier time understanding them and deciding which solution, or language, might be better applicable to their particular problems.
I ended up watching some livestreams and had the idea of porting the stackcollapse-xdebug.php file to Rust, potentially so it could be included in the project in the future. Algorithms and optimization to solve all Advent of Code 2018 puzzles in under one total second.
Naturally, when you are new to microcontrollers (like me), you may have a few questions: When we upload a program on this development board, at what speed it is actually running? I’ve been getting into bioinformatics algorithms lately and ran across an interesting pull request that improved performance by changing a Rust match expression to a lookup.
By that time there were already some D and Rust versions floating about as a result of the Reddit thread, so fortunately for lazy me “all” I had to next was to benchmark the lot of them. Advent of Code really helps show off the things that make Rust shine, demonstrating the power and utility of many community-created crates as well as the language itself.
A gentle comparison between Rust & Python from multiple perspectives against a small, relatively simple problem. The rewrite took a fair bit longer than expected, but the results were good (about 9 times faster and ½ the memory usage).
The parser I implemented wasn't correct, it led to rounding error for most representations, by using floats for intermediate values. Furthermore, the comparisons used data unlikely to be encountered in real-world datasets, overstating the performance differences by forcing Rust to use slower algorithms.
A while back, there was a discussion comparing the performance of using the hash brown crate (based on Google’s Suitable implementation) in the Rust compiler. In the last Rustiest, Manner was experimenting on integrating his crate into std lib, which turned out to have some really promising results.
While the integration is still ongoing, there’s currently no blog post out there explaining Suitable at the moment. So, I thought I’d dig deeper into the Rust implementation to try and explain how its (almost) identical twin hash brown::Yashmak works.
A friend recently told me about a puzzle, which is a great excuse to explore programming craft. The only major difference is that it takes advantage of mutability (which is idiomatic in Rust, unlike in Clojure).
Nike Mistakes recently blogged about the Rust compiler’s new borrow checker, which implements non-lexical lifetimes (All). So every day, we generate about 500 files that are 85 MB on disk, and contain about a million streaming JSON objects that take up 1 GB when uncompressed.
Very is rarely memory usage considered in Python, and I likewise wasn’t paying too much attention when sparse was first being built. To explore this, I ran some state map rendering tests on Smarts on a single-socket Haskell server (Leon E3-1270 v3) running at 3.50GHz.
The input file (~30 MB compressed) contains 3.5M state changes, and in the default config will generate a ~6 MB SVG. It then runs them and highlights potential performance regressions in the standard library and the output of the compiler.
Each tool chain’s run is summarized with a list of likely candidates, as seen in the image below, and we’re now getting started using these to safeguard the performance of Rust programs. This release also marks the first time that one of my original test files can be pretty printed in less than a second.
The experience was fun, so I thought I’d write up a little about the algorithm I’ve used and some interesting stats about how it performs. In my last post I wrapped up the patches to improve perceived performance of screenshots on the Linux GNOME desktop.
With that done, why not implement my crazy plan for parallel PNG encoding to speed the actual save time? The basic idea of a performance test here is to send many HTTP requests to the web service (the reverse proxy in this case) and measure how fast the responses arrive back.
Comparing the results from Punish and Varnish should give us an idea if our performance expectations are holding up. This document is a compilation of various benchmarks and comparisons between code counters, namely token, clock, SCC, and LOC.
Polyglot is not currently included as it was unable to be installed on the machine at the time of writing. In fact, it meant that the fastest one token was suddenly faster than my SCC for almost all tests.
In addition, a new project polyglot written in a language I have never heard of ATS popped up which is also now faster than my Go program for any repository when running on a machine with less than 8 cores. If you read the announcement, you will see that Sims should bring performance enhancements to our applications if we learn how to use it properly.
The first thing I tried was implement an SSE2 version of sin_cos based off of Julien Pommies’s code that I found via a bit of googling. The other part of this post is about Rust ’s runtime and compile time CPU feature detection and some wrong turns I took along the way.
Following on from path tracing in parallel with Rayon I had a lot of other optimizations I wanted to try. He’d done a fair amount of optimizing so it seemed like a good target to aim for.
To get a better comparison I copied his scene and also added his light sampling approach which he talks about here. First, is it shouldn’t introduce too much (preferably zero) overhead for its abstractions and be fast out of the box.
I’ve played with it before (in my master’s thesis), it’s relatively simple and the effects of optimizing it can be great. For simplicity, I’ve decided to multiply only square matrices with power-of-two sizes, but these restrictions can be lifted in a real implementation without significantly loosing performance only the code gets somewhat more complex and hairy.
I’d like to share a quick story about the sheer power of LLVM and the benefits of using higher-level languages over assembly. Therefore, one the plans I described at the end of the blog post was a little tool for Copy Propagation on Tree files.
Also, in the meantime rust’s build system has been replaced and its benchmark suite has been overhauled. At Chain, we (Henry DE Valence, Kathie Run and Oleg Andrew) have been working on a pure- Rust Bulletproofs implementation, whose initial version we are publishing today, together with a set of notes.
This blog post will describe the first set of improvements that were implemented for this use-case, together with a minimal benchmark and the results. Also, parallelization works better and usage of different CPU cores is more controllable, allowing for better scalability.
Writing a debugger for C++ on Linux, you spend a lot of time examining pretty-printed DWARF debug information using tools like read elf, obj dump or dwarf dump. TL;DR: I reduced the dump time from 506s to 26s by fixing some simple issues and taking advantage of Rust “fearless parallelism”.
Recently, I came across an ad for a job that had a precondition for application: it required you to first solve a programming challenge: Clear output, a simple API and reasonable defaults make it easy to use even for developers without a background in statistics.
Writing an explicit AVX2-accelerated version of baseds encoder and decoder, then realizing I'd have to do the same thing again to see the speedups on my Ivy Bridge desktop, pushed me to make this library.