The Detroit Post
Saturday, 16 October, 2021

Rust For Data Science

author
Elaine Sutton
• Wednesday, 11 November, 2020
• 9 min read

This isn’t something that I would do very often, but a call was made, and I would like to take that chance to fill in some ideas with another context in mind. Doing actual science and obtaining results fast and productively is extremely important, since we are often evaluated by our scientific publication output.

(Source: www.micronaut.ch)

Contents

On the other hand, many concerns of conceiving production-ready solutions with that state-of-the-art are frequently left as a second priority, given the technical debt that not many research groups worry enough to overcome. They may often involve a cycle where models are designed, trained, measurements are made, observations are taken, parameters are fine-tuned, and back to step 1 or 2 we go.

Without extending the introduction any further, here are the points that, in my opinion, should be considered when working with Rust in these (mostly academic) fields. With that said, let’s stop that thought for a moment and keep in mind that many mature technologies for data science exist today.

We even have Julia, which I like to call MATLAB’s cool younger cousin, and it boasts some interesting perks of its own. And many people would rather keep defying gravity than choosing a stack without the necessary tools for the job.

By creating new Rust tools for data scientists, we could be taking the unnecessary risk of “competing” with all the others without a clear reason why someone should be switching other than “just because” (or are you really throwing in the argument that it’s safe and provides fearless concurrency?). For example, the Leaf project didn’t quite work out, but we can use TensorFlow today, or at least enough to load saved models and serve them through a Rust stack, thanks to the actively maintained bindings.

The only approach known to work pretty well is not to use C++ APIs at all: just create pure C headers and the respective wrapper implementation. 2018–04–04 Update: If you wish to learn more about writing Rust bindings to C++ libraries, consider reading my story on Taking the long road.

faster data speeds storage speed fast quick
(Source: www.infoworld.com)

Oftentimes, the web API can be as simple as sending serialized objects (with serve, of course!) For instance, the Thrones group has recently released a provisional specification of the Neural Network Exchange Format (UNEF), intended to harmonize neural network tools and inference engines.

2018–04–04 Update: one initiative of writing a pure Rust parser of UNEF files was made last month. Way before we think about making new tools for data scientists and the like, we should consider the means through which we can add solutions written in Rust.

Think of it as a sandwich, were we can use Rust to make a native implementation of demanding algorithms, and at the same time serve these solutions with production-ready servers. One of them, although not necessarily one that would strike you as a major flaw, is reading and writing to files in the HDF5 format.

The site www.arewelearningyet.com is the de facto aggregation of machine learning tools for Rust developers, and is worth keeping an eye on. Moreover, consider visiting the ecosystem Working Group, which is focused on the sustainability and maturity of Rust.

I will end with a semi-open question: what makes an ideal tool or library for data scientists? They provide extendable interfaces, so that more algorithms and components can be easily coupled together in a single script.

rust belt independent thumbnails
(Source: www.independent.co.uk)

The Python ecosystem does this by using common data structures and by “mimicking” those interfaces in custom types (namely bumpy and pandas, to name the most important ones). While we can claim that Rust code is pretty well optimized, the difference is less relevant when relying on GPU-accelerated computation APIs such as Cuba and OpenCL.

If you do not wish to deal with low-level intrinsic, how about using a middle-level crate such as faster, or even Bias and LAP ACK bindings? This bullet point can refer to what so many other Rust2018 blog posts have stated about the future of Rust.

The sheer number of available high-quality analytic libraries and its massive developer community make Python an easy choice for many data scientists. These “lower-level” language implementations are used to mitigate some common criticisms of Python, specifically execution time and memory consumption.

Bounding execution time and memory consumption simplifies scalability, which is critical for cost reduction. If we can write performant code to accomplish data science tasks, then integration with Python is a major advantage.

The intersection of data science and malware analysis requires not only fast execution time but also efficient use of shared resources for scalability. Getting good performance on modern processors requires parallelism, typically via multiple threads, but efficient execution time and memory usage are also necessary.

(Source: www.micronaut.ch)

While external platform-specific libraries exist, the onus is clearly on the developer to preserve thread safety. Malware often manipulates file format data structures in unanticipated ways to cause analysis utilities to fail.

One relatively common Python parsing pitfall is caused by the lack of strong type safety. Python’s gratuitous acceptance of None values when a byte array was expected can easily lead to general mayhem without littering the code with None checks.

The Rust language makes many claims that align well with an ideal solution to the potential problems identified above: execution time and memory consumption comparable to C and C++, along with providing extensive thread safety. The Rust language offers additional beneficial features, such as strong memory-safety guarantees and no runtime overhead.

No runtime overhead simplifies Rust code integration with other languages, including Python. Data science is a very broad field with far too many applications to discuss in a single blog post.

An example of a simple data science task is to compute information entropy for byte sequences. Then we calculate the negative of the weighted sum of the probability of a particular value, x i, occurring (P x (x i)) and the so-called self-information (log 2 P x (x i)).

vernier caliper physics
(Source: www.bharatscientific.in)

This is a simplistic assessment of how performant Rust can be for data science applications, not a criticism of Python or the excellent libraries available. In these tests, we will generate a custom C library from Rust code that we can import from Python.

We start with a simple pure Python function (in entropy.py) to calculate the entropy of a byte array only using the standard library math module. Again, we’re just going for a (hopefully not too) slow test drive to see the performance of Rust compiled libraries imported from Python.

Then we simply call the provided library function we specified earlier when we initialized the Python module with the py_module_initializer! At this point, we have a single Python module (entropy.py) that includes functions to call all of our entropy calculation implementations.

We measured the execution time of each function implementation with test benchmarks computing entropy over 1 million random bytes. Finally, we made separate, simple driver scripts for each method for calculating entropy.

All methods repeat the calculations 100 times in order to simplify capturing memory usage data. The Rust version exhibited only slightly better performance than City/Bumpy, but the results confirmed what we had already expected: pure Python is vastly slower than compiled languages, and extensions written in Rust can be extremely competitive with those written in C (even beating them in this micro benchmark).

saw jigsaw horror scramble
(Source: www.theregister.co.uk)

We could have added type hints and used Python to generate a library we could import from Python. While the pure Python and Rust implementations have very similar maximum resident set sizes, the City/Bumpy uses measurably more memory in this benchmark, presumably due to additional capabilities loaded into memory when they are imported.

In either case, calling Rust code from Python does not appear to add a substantial amount of memory overhead. In our admittedly brief assessment, our Rust implementation performance was comparable to the underlying C code from City and Bumpy packages.

Not only was Rust performant in execution time, but its additional memory overhead was also minimal in these tests. The execution time and memory utilization characteristics should prove ideal for scalability.

The performance of the City and Bumpy C FFI implementations are certainly comparable, but Rust provides additional benefits that C and C++ do not. External libraries exist to provide this functionality for C, but the onus of correctness is entirely on the developer.

Rust checks for thread safety issues, such as race conditions, at compile time with its ownership model, and the standard library offers a suite of concurrency mechanisms, including channels, locks and reference counting smart pointers. We are not advocating that anyone port City or Bumpy to Rust, because these are already heavily optimized packages with robust support communities.

cut meteorite spherical meteorites yours tweet comments
(Source: boingboing.net)

On the other hand, we would strongly consider porting pure Python code to Rust, which is not otherwise available in high-performance libraries. For data science applications in the security space, Rust seems like a compelling alternative given its speed and safety guarantees.

Rust-Powered Command-Line Utilities to Increase Your Productivity | by Shining Oneida | Towards Data Science Last month, I wrote an article sharing seven Rust -powered command-line utilities.

Those are modern and fast tools you can use every day with your terminal. Since publishing that original article, I’ve been searching for more Rust -powered command-line utilities, and I discovered more gems that I’m excited to share with you today.

Dust gives you an instant overview of directories and its disk space and commands are simpler. Image by the authoring the above image, you can see that the app directory occupies 57 MB(29%) and the target directory occupies 139 MB (71%) of disk space.

Dutree is another Du alternative to analyze disk usage. Hyperfine is a Rust -powered, time alternative, command-line benchmarking tool.

possum nz fur
(Source: www.doc.govt.nz)

Line search with SK command In Linux, you can use used to perform basic text replacement. Image from SD top in Linux is an interactive process viewer.

Image by the author When you use another computer, server, or system, you will be using Linux commands. It is a good idea to keep using Linux commands even if you are using Rust -powered alternatives.

Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to Thursday. By signing up, you will create a Medium account if you don’t already have one.

Other Articles You Might Be Interested In

01: 1st Class Real Estate Virginia Beach Va
02: 1st Enterprise Real Estate Cleveland Tn
03: Pacifica Real Estate San Diego
04: Pacific Real Estate Glendale Ca
05: Pacific Real Estate Long Beach Wa
06: Pacific Real Estate Modesto
07: Pacific Union Real Estate San Francisco
08: Pacific View Real Estate San Diego
09: Pacific View Real Estate Seattle
10: Packages For Verizon Tv
Sources
1 www.verizon.com - https://www.verizon.com/home/fiostv/
2 go.verizon.com - https://go.verizon.com/residential/fios-bundles
3 www.verizonspecials.com - https://www.verizonspecials.com/verizon-fios-bundles