When I was little, programming was simple. My friend had a computer and there were Basic and Assembly. You could write your program in Basic, which was easier to do but your program would be slow, or you could write something in Assembly, which was harder, but your program would run significantly faster.
The explanation for this was simple too. Basic was an interpreter, in order to run your program, it had to go through your code every time you invoke it, and interpret it line-by-line. If it says “PRINT X”, the interpreter had to find a variable named “X”, find a routine that does printing, and call the found routine for the found variable.
Assembly was, well, assembly. It did interpret your program in a way too, but only once when you run the assembler. After that, your program would be executable without interpretation. Programs that need interpretation run slower than programs that don’t need interpretation. If, of course, they are equivalent programs.
And in Basic and Assembly, they usually were. Basic is an imperative language, not even too friendly to structural programming. Even a thing as staple as a “function” is not a built-in language construct there but a pattern: “GOSUB … RETURN”, pretty much like “call … ret” in Assembly.
Now fast forward 30 years. Languages are plenty. Computers are everywhere. Programming is not simple anymore. My department earns its bread by rewriting researcher’s code originally written in Python in C++ for performance because common knowledge is that Python is interpreted and slow, and C++ is compiled and fast. But somehow, every year this rewrite gets harder and harder to win any performance from. Something changes and changes fast. The common knowledge however doesn’t change so we keep rewriting.
But now we’re forced to optimize everything like crazy just to justify what we do. An algorithm comes in in Python, we rewrite it in C++ equivalently, and suddenly it becomes 3 times slower. That’s… not what we’re here for. So we reengineer the algorithm to push and get the performance boost we promised. And most of the time, since researchers don’t care about performance at all and algorithmic-wise they do leave some low-hanging fruits behind, this works.
Still, this whole business now looks like a scam. We’re making code slower by rewriting it in C++ just so we could make it faster by reengineering the code. Why don’t we re-engineer it directly in Python then? Ah! The thing is, we don’t know Python. We know a little Python, enough to read and understand, but not enough to make ultra-fast programs in it.
So what’s there to know?
From our partners:
Most of the Python libraries are written in C or Fortran. NumPy core is written in C; Pandas – in Cython and C; SciPy– in Fortran, C, and partially C++. They have no reasons to be slower than whatever was written in C++, Rust, or Julia. They can be faster though.
In our company, we cater to both cloud services and desktop applications. And desktop users are getting angry when the new version of their favorite application stops working on their hardware for apparently no reason. So we keep our desktop builds target old. Really old, like pre-Nehalem old. This way, nobody gets angry but nobody gets to enjoy SSE3 either.
Of course, a computational library built for a proper target will be generally faster than an equivalent library built for a generic 15-year-old computer with limited superscalar capabilities.
The good news is if you’re building for a cloud, you can set your target build platform to be exactly the machine you procure, and then your C++ libraries will run at the same speed as Python ones and maybe even a little bit faster.
To be honest, the whole argument about which language is faster is ridiculous. A language is not a compiler or an interpreter, it is what it is – a language: a set of rules that specify how we should tell a computer what we want if to do. A language is just a set of rules, a specification. And nothing else.
The very distinction between interpretation and compilation is something from the last century. Nowadays, there are C interpreters like IGCC, PicoC, or CCons, and there are Python compilers. JIT compilers such as [PyPy] and classic compile-before-you-run compilers such as Codon (that also has JIT capability if you only want part of your code to be compiled).
Codon is built upon LLVM, the same infrastructure Rust, Julia, or Clang are built upon. The code built with Codon runs, give or take, at the same performance levels as built with any of those. There might be performance disadvantages due to Python’s garbage collection or large native data types but we’re not talking about 100x or 10x anymore. LLVM does its magic. it turns Python code into machine code for you.
There are also myths about just-in-time compilation or JIT. Some say that it is superior to the usual compile-before-you-run technique because it always compiles for the architecture users have, and thus exploits it optimally. Some say that there is still compilation overhead that with just-in-time compilation also falls on users. This makes programs run slowly since they have to both run and compile themselves at the same time.
The problem with both myths is that they are both true and both unhelpful. Yes, JIT generally compiles to a better machine code unless you build your binaries explicitly for the target machine, which, by the way, happens quite regularly when you deploy in a cloud. And yes, there is a compilation penalty in runtime, but it is negligible if your runtime is measured in months, which, again, when you deploy in a cloud, is not something unheard of.
So there are pros and cons. What’s important, Python (Codon specifically) supports both compile-before-you-run and JIT modes so you can choose what will suit your needs the most. Traditional compilers, such as Clang, do not have the JIT option.
Numba and its kernel model
Speaking of JIT, Numba is probably the most game-changing technology in the world of ultra-fast Python programming. It is a compiler but it only targets selected kernels, not the whole program. You, of course, get to select what should be compiled and for which platform. In this setup, you can run pieces of your code on the CPU, and others – on GPGPU.
Technically, one can create backends for other specialized devices such as Google’s TPU or even Lightmatters’ photonic accelerator. There is no such backend yet, those guys decided to roll out their own library instead. But, what’s symptomatic, they also choose to provide the interface for the photonic computer in Python so you could interact with Pytorch, Tensorflow, or ONNX seamlessly.
So Lightmatter is not yet there. But NVidia is. They did provide their CUDA backend for Numba and you can now write kernels in Python, and run them on NVidia hardware with maximum efficiency. C++ doesn’t have that. There is, however, a CU dialect, coming from NVidia, of course, that extends C++ for this very matter. In Python, you don’t have to extend the language itself. Since Numba works as a JIT kernel compiler, adding a backend is just a matter of patching a library.
So, the kernel model is targeting heterogeneous computing. You can run pieces of code on different devices, which is nice by itself. But there is another dimension of heterogeneity you might have not thought about. With the kernel model, you can target different kernels for different computation contexts and not necessarily hardware devices. This means that if you want one kernel to be fast but not particularly precise, you can build it with a “-fast-math” option. But if, in some other context, you want that kernel to be precise rather than fast, you can rebuild the very same code without the trade-off.
This is something that is hard to achieve with traditional compilers where you can’t change compilation options in the middle of a translation unit. Well, with the kernel model, every kernel is its own translation unit.
Python is not slow. Neither it is fast. It is just a language, a set of rules and keywords. But there are lots of people who got used to these rules and these keywords. They feel comfortable writing in Python and they are interested in making Python better for them.
This user base is large enough to attract both new startups with revolutionary technologies such as Lightmatter and their photonic computers, and well-established companies with decades of expertise in high-performance computing such as NVidia. All these people are vastly invested in making Python a better… not language, of course, but environment. The environment in which writing ultra-fast programs is just marginally harder than writing slow throwaways.
Altogether they are making huge progress. Python is getting faster every year. At this point, forget about photonic computers if you can, programs written in Python often run on par with the ones written in Julia, C++, or Rust. But Python wouldn’t stop there. It is getting faster than traditional compilers are getting user-friendlier.
Be prepared to see that in a few years Python compilers: PyPy, Numba, or something completely new, will borrow techniques from Spiral or Herbie to generate code so much more effectively that no traditional compiler could possibly come close. After all, writing a new JIT backend in Python is way easier than reimagining the whole LLVM infrastructure.
By: Oleksandr Kaleniuk