Writing Cython

In this notebook, we'll take a look at how to implement a simple function using Cython. The operation we'll implement is the first-order diff, which takes in an array of length $n$:

$$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n\end{bmatrix}$$

and returns the following:

$$\mathbf{y} = \begin{bmatrix} x_2 - x_1 \\ x_3 - x_2 \\ \vdots \\ x_n - x_{n-1} \end{bmatrix}$$

First we'll import everything we'll need and generate some data to work with.

In [1]:
import numpy as np
x = np.random.randn(10000)

Below is a simple implementation using pure Python (no NumPy). The %timeit magic command let's us see how long it takes the function to run on the 10,000-element array defined above.

In [2]:
def py_diff(x):
    n = x.size
    y = np.zeros(n-1)
    for i in range(n-1):
        y[i] = x[i+1] - x[i]
    return y

%timeit py_diff(x)
3.57 ms ± 8.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now use the exact same function body but add the %%cython magic at the top of the code cell. How much of a difference does simply pre-compiling make?

In [3]:
%load_ext cython
In [4]:
%%cython
import numpy as np
def cy_diff_naive(x):
    n = x.size
    y = np.zeros(n-1)
    for i in range(n-1):
        y[i] = x[i+1] - x[i]
    return y
In [5]:
%timeit cy_diff_naive(x)
2.88 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So it didn't make much of a difference. That's because Cython really shines when you specify data types. We do this by annotating the variables used in the function with cdef <type> .... Let's see how much this improves things.

Note: array types (like for the input arg x) can be declared using the memoryview syntax double[::1] or using np.ndarray[cnp.float64_t, ndim=1].

In [6]:
%%cython
import numpy as np
def cy_diff(double[::1] x):
    cdef int n = x.size
    cdef double[::1] y = np.zeros(n-1)
    cdef int i
    for i in range(n-1):
        y[i] = x[i+1] - x[i]
    return y
In [7]:
%timeit cy_diff(x)
24.5 µs ± 548 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That made a huge difference! There are a couple more things we can do to speed up our diff implementation, including disabling some safety checks. The combination of disabling bounds checking (making sure you don't try access an index of an array that doesn't exist) and disabling wraparound (disabling use of negative indices) can really improve things when we are sure neither condition will occur. Let's try that.

In [8]:
%%cython
from cython import wraparound, boundscheck
import numpy as np

@boundscheck(False)
@wraparound(False)
def cy_diff2(double[::1] x):
    cdef int n = x.size
    cdef double[::1] y = np.zeros(n-1)
    cdef int i
    for i in range(n-1):
        y[i] = x[i+1] - x[i]
    return y
In [9]:
%timeit cy_diff2(x)
24.8 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Finally, let's see how NumPy's diff performs for comparison.

In [10]:
def np_diff(x):
    return np.diff(x)

%timeit np_diff(x)
15 µs ± 603 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Why is NumPy's diff implementation faster? Maybe use the --annotate flag to peek at the C code generated by our latest Cython implementation.