Monday, May 7, 2012

STM update: back to threads?

Hi again,

Here is another update on the status of Software Transactional Memory on PyPy.

Those of you who have been closely following this blog since last year know that, from the very first post about STM, I explored various design ideas about the API that we should get when programming in Python.

I went a full circle, and now I am back to where I started (with, important difference, a very roughly working implementation of pypy-stm).

What I realized is that the "thread" module is not that bad after all --- I mean, yes, it is a horribly low-level interface, but it is general enough to build various interesting things on top of it. What the "stm-thread" branch of PyPy contains is, basically, the regular "thread" module in which the GIL was replaced with STM. It gives multicore capabilities to any program based on multiple threads. (This is so far exactly the idea same than the one being investigated for Hardware Transactional Memory. It is roughly also what you would get if you managed to convince GCC 4.7 to compile CPython using STM.)

Now while this might already be quite interesting to some people, here is how it relates to all I said previously: namely, threads are bad, and some new "transaction" module would be a better idea.

There is one new core functionality in the "stm-thread" branch: it is "thread.atomic", a context manager that can be used in a "with" statement (exact name subject to change). In terms of the GIL, it prevents the GIL from being released in the "with" block. In terms of STM, it prevents a "transaction break", which means that the whole "with" statement runs in one single transaction. (From the Python programmer's point of view, the net effect is the same.)

So far, no ground-breaking news. But what I missed previously is that this is enough to give multicore capabilities even to a program that is not using threads so far. It is possible to rewrite an equivalent of the old transaction module in a few pages of pure Python, using "thread.atomic". Something along the following lines: start N threads that each reads from a Queue.Queue() the next job to do, and does it in a "with thread.atomic" block. The STM version of PyPy is then able to run these atomic blocks concurrently. The key point is that the slightly delicate handling of threads should be nicely hidden inside the new "transaction" module, and from outside the observed behavior would be exactly as if the transactions that we schedule are run serially.

The point I kept missing was that, yes, this sounds like nonsense, because it seems that we create N threads just to serialize their work again in "thread.atomic" sections. In fact this would be nonsense in any model that would "just" remove the GIL to let multiple threads run concurrently without crashing. Indeed, you have multiple threads, but their atomic blocks would be again a sort of GIL: only one of them would run at a time. And this is indeed the simple model of execution that you get even with STM --- but not the model of performance. The performance with STM scales with the number of cores, as long as there is enough non-conflicting work to do.

So in summary the complete circle back to the starting point is that threads might be a good low-level model. It mends itself naturally to, say, a kind of program in which the main thread polls file descriptors using select() or the Linux epoll(), and the work received is split along N other threads --- which is the kind of program you would naturally write in other languages that don't have a GIL, say Java. The other threads can then use "thread.atomic" blocks to protect sections of their work. The traditional Transactional Memory point of view is that you use such blocks to guard the short sections of code that communicate with other threads or modify global state, but nothing prevents you from using much larger sections: you should be able to scale them up to the size of a native "unit of work", so that every unit is naturally atomic. And then it's only a matter of design: you can tweak an existing module that does the thread pooling to add one "with thread.atomic"; or do it yourself from scratch; or (if the design is compatible enough) just plug in the proposed pure-Python "transaction" module. Or if you feel like it you can even use threads directly (but keep in mind that using threads too explicitly is not a composable abstraction, whereas higher-level designs typically are).

At the end of the day, you can write or reuse programs whose global structure you are already familiar with, for example with a thread pool (that can be hidden in a library if you prefer), or any other structure with or without explicit threads. But you can do so without all the mess that comes with threads like locks and deadlocks. From that angle it is really similar to Garbage Collection: e.g. the Boehm GC (now used by GCC itself) lets you write C code like you are used to, but forgeting all you had to learn about careful explicit memory management.