Bypassing the Python GIL with ctypes

I recently read an interesting article (actually, the slides linked to) about the horror that is the Global Interpreter Lock in Python, especially with multicore CPUs. And I agree — in these cases, the GIL is painful.

My favorite way of bypassing the GIL is to use ctypes, a wonderful library that allows you to easily link to dynamic libraries and call the functions from C, with only a small amount of boilerplate (to map function calls, argument types, and return types).

The best feature of ctypes is that when a program is executing a ctypes function, it releases the GIL. Meaning that if you have more than one thread threads, and one of them is busy with a ctypes call, the other threads can go along their merry way.

In the slides above, he shows that Python CPU-intensive multithreaded applications slow down as the number of CPUs increase. Well, I decided to use a quick counterexample.

First, I create a C file to do some work for me, called test.c:

int test(int from, int to)
{
  int i;
  int s = 0;
 
  for (i = from; i < to; i++)
    if (i % 3 == 0)
      s += i;

  return s;
}

To compile this as a dynamic shared library under OS X, the following two commands can be used:

gcc -g -fPIC -c -o test.o test.c
ld -dylib -o libtest.dylib test.o

(Under Linux, replace this last line with ld -shared -o libtest.so test.o)

Then, we can use the following Python program to load the dynamic library and run a quick test (should work in Linux or OS X):

import ctypes
import ctypes.util
import threading
import time

testname = ctypes.util.find_library('test')
testlib = ctypes.cdll.LoadLibrary(testname)

test = testlib.test
test.argtypes = [ctypes.c_int, ctypes.c_int]

def t():
  test(0, 1000000000)

if __name__ == '__main__':
  start_time = time.time()
  t()
  t()
  print "Sequential run time: %.2f seconds" % (time.time() - start_time)

  start_time = time.time()
  t1 = threading.Thread(target=t)
  t2 = threading.Thread(target=t)
  t1.start()
  t2.start()
  t1.join()
  t2.join()
  print "Parallel run time: %.2f seconds" % (time.time() - start_time)

On my quad-core OS X box, I get the following output:

Sequential run time: 13.27 seconds
Parallel run time: 6.66 seconds

A pretty solid doubling of performance, which is what we would hope.