[c¼h] Parallelised numerics in Python: An introduction to Bohrium

Thursday a week ago I gave a brief introductory talk in our Heidelberg Chaostreff about the Bohrium project. Especially after the HPC day at the Niels Bohr Institute during my recent visit to Copenhagen, I became rather enthusiastic about Bohrium and wanted to pass on some of my experiences.

The main idea of Bohrium is to build a fully numpy-compatible framework for high-performance computing, which can automatically parallelise numpy array operations and/or execute them on a general-purpose graphics cards. The hope is that this eradicates the step of rewriting a prototypical python implementation of a scientific model in more low-level languages like C++ or CUDA before dealing with the actual real-world problems in mind.

In practice Bohrium achieves this by translating the python code (via some intermediate steps) into small pieces of C or CUDA code. These are then automatically compiled at runtime of the script, taking into account the current hardware setup, and afterwards executed. The results of such a just-in-time compiled kernel are again available in numpy-like arrays and can be passed to other scripts for post-processing, e.g. plotting in matplotlib.

It is important to note, that the effect of Bohrium is limited to array operations. So for example the usual Python for loops are not touched. This is, however, hardly a problem if the practice of so-called array programming is followed. In array programming one avoids plain for-loops and similar traditional python language elements in preference for special syntax which works on blocks (numpy arrays) of data at once. Examples of such operations is pretty much the typical numpy workflow:

  • views and slices: array[1:3]
  • broadcasting: array[:, np.newaxis]
  • elementwise operations: array1 * array2
  • reduction: np.sum(array1, axis=1)

A slightly bigger drawback of Bohrium is, that the just-in-time compilation takes time, where no results are produced. In other words Bohrium does only start to pay of at larger problem sizes or if exactly the same sequence of instructions is to be executed many times.

In my c¼h I demonstrate Bohrium by the means of this example script

#!/usr/bin/env python3

import numpy as np
import sys
import time


def moment(n, a):
    avg = np.sum(a) / a.size
    return np.sum((a - avg)**n) / a.size


def compute(a):
    start = time.time()

    mean = np.sum(a) / a.size
    var  = moment(2, a)
    m3   = moment(3, a)
    m4   = moment(4, a)

    end = time.time()

    fmt = "After {0:8.3f}s: {1:8.3f} {2:8.3f} {3:8.3f} {4:8.3f}"
    print(fmt.format(end - start, float(mean), float(var),
                     float(m3), float(m4)))


def main(size, repeat=6):
    for i in range(repeat):
        compute(np.random.rand(size))


if __name__ == "__main__":
    size = 30
    if len(sys.argv) >= 2:
        size = int(sys.argv[1])
    main(size)

which is also available for download. The script performs a very simple analysis of the (randomly generated) input data: It computes some statistical moments and displays them to the user. For bigger arrays the single-threaded numpy starts to get very slow, whereas the multi-threaded Bohrium version wins even thought it needs to compile first. Running the script with Bohrium does not require one to change even a single line of code! Just

python3 -m bohrium ./2017.07.13_moments.py

does kick off the automatic parallelisation.

The talk has been recorded and is available on youtube. Note, that the title of the talk and the description are German, but the talk by itself is in English.