Python examples demonstrating performance improvements using cython and numba

I came across an old post by jakevdp on Numba vs Cython. I thought I will revisit this topic because both Numba and Cython has matured significantly over this time period. In this post I am going to do two examples:

  1. Pairwise distance estimation example that Jake discusses. The intention is to see how the maturity of these projects has contributed to improvements.
  2. A simple cashflow payment calculation of an amortizing bond or mortgage payments. This a calculation that cannot be vectorized in a numpy sense. So the speedups would have to come from optimizing loops using tools like Numba or Cython.
In [1]:
import numpy as np
import numba 
import cython
%load_ext cython
import pandas as pd

numba.__version__, cython.__version__, np.__version__
Out[1]:
('0.31.0', '0.24.1', '1.11.3')

Pairwise Distance Estimation

In [2]:
X = np.random.random((1000, 3))
In [3]:
def pairwise_python(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, M), dtype=np.float)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)
    return D

%timeit -n10 pairwise_python(X)
10 loops, best of 3: 2.47 s per loop
In [4]:
def pairwise_numpy(X):
    return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))

%timeit -n10 pairwise_numpy(X)
10 loops, best of 3: 38.3 ms per loop
In [5]:
pairwise_numba = numba.jit(pairwise_python)

%timeit -n10 pairwise_numba(X)
The slowest run took 13.98 times longer than the fastest. This could mean that an intermediate result is being cached.
10 loops, best of 3: 4.04 ms per loop
In [6]:
%%cython
import numpy as np
cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise_cython(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = sqrt(d)
    return np.asarray(D)
In [7]:
%timeit -n10 pairwise_cython(X)
10 loops, best of 3: 6.6 ms per loop

The timiing for the results in Jake's post (2013) and the results from this post (2017) are summarized below.

In [8]:
df1 = pd.DataFrame({"Time (ms)": [13400,111, 9.12, 9.87], "Speedup": [1, 121, 1469, 1357]}, 
             index=["Python", "Numpy", "Numba", "Cython"])
df2 = pd.DataFrame({"Time (ms)": [2470, 38.3, 4.04, 6.6], "Speedup": [1, 65, 611, 374]}, 
             index=["Python", "Numpy", "Numba", "Cython"])
df = pd.concat([df1, df2], axis = 1, keys=(["2013", "2017"]))
df
Out[8]:
2013 2017
Speedup Time (ms) Speedup Time (ms)
Python 1 13400.00 1 2470.00
Numpy 121 111.00 65 38.30
Numba 1469 9.12 611 4.04
Cython 1357 9.87 374 6.60

The timings are speedup number for the 2013 and 2017 runs are very different due to differences in versions and perhaps even the python version. This post is using Py35 running in Windows. The take away here is that the numpy is atleast 2 orders of magnitude faster than python. And the numba and cython snippets are about an order of magnitude faster than numpy in both the benchmarks.

I will not rush to make any claims on numba vs cython. It is unclear what kinds of optimizations is used in the cython magic. I would expect the cython code to be as fast as C and perhaps some tweaking will help us get there. It is really interesting how easy it is to get performance boost from numba. From an ease of use point of view, numba is hands down winner in this simple example.

Amortizing Payments

Here lets look at one more example. This is an amortizing payment calculation, such as in mortgage payments.

In [9]:
def amortize_payments_py(B0, R, term, cpr=0.0):
    smm = 1. - pow(1 - cpr/100., 1/12.)
    r = R/1200.
    S = np.zeros(term)
    P = np.zeros(term)
    I = np.zeros(term)
    B = np.zeros(term)
    Pr = np.zeros(term)
    Bt = B0
    pow_term = pow(1+r, term)
    A = Bt*r*pow_term/(pow_term - 1)
    for i in range(term):
        n = term-i
        
        I[i] = Bt * r
        Pr[i] = smm*Bt
        S[i] = A-I[i] if Bt>1e-2 else 0.
        P[i] = S[i] + Pr[i]
        Bt = max(Bt - P[i], 0.0)
        
        B[i] = Bt
    return S,I, Pr,P, B

Here is the equivalent Cython function.

In [10]:
%%cython
cimport cython 
import numpy as np
from libc.math cimport pow

@cython.boundscheck(False)
@cython.wraparound(False)
def amortize_payments_cy(double B0,double R,int term,double cpr=0.0):
    cdef double smm = 1. - pow(1 - cpr/100., 1/12.)
    cdef double r = R/1200.
    cdef double[:] D = np.empty(term, dtype=np.float64)
    cdef double[:] S = np.empty(term, dtype=np.float64)
    cdef double[:] P = np.empty(term, dtype=np.float64)
    cdef double[:] I = np.empty(term, dtype=np.float64)
    cdef double[:] B = np.empty(term, dtype=np.float64)
    cdef double[:] Pr = np.empty(term, dtype=np.float64)
    cdef double Bt = B0
    cdef double pow_term = pow(1+r, term)
    cdef double A = Bt*r*pow_term/(pow_term - 1.)
    cdef double n = term
    cdef int i=0
    for i in range(term):
        n = term-i       
        I[i] = Bt * r
        Pr[i] = smm*Bt
        S[i] = A-I[i] if Bt>1e-2 else 0.
        P[i] = S[i] + Pr[i]
        Bt = max(Bt - P[i], 0.0)        
        B[i] = Bt
    return np.asarray(S),np.asarray(I), np.asarray(Pr),np.asarray(P), np.asarray(B)

Here is the Numba version

In [11]:
amortize_payments_nb = numba.njit(cache=True)(amortize_payments_py)

Let's compare the performance of the three function types.

In [12]:
B0 = 500000.
R = 4.0
term = 360

Python

In [17]:
%timeit -n1000 S,I, Pr,P, B = amortize_payments_py(B0, R, term, cpr=10)
1000 loops, best of 3: 369 µs per loop

Numba

In [18]:
%timeit -n1000 S,I, Pr,P, B = amortize_payments_nb(B0, R, term, cpr=10)
1000 loops, best of 3: 8.64 µs per loop

Cython

In [19]:
%timeit -n1000 S,I, Pr,P, B = amortize_payments_cy(B0, R, term, cpr=10)
1000 loops, best of 3: 28.7 µs per loop

Once again we see that cython and numba methods are orders of magnitude (roughly 15X) faster than python version. The numba version is almost 3X faster than Cython. There are probably some performance cython tweaks that I am missing perhaps.

Conclusion

In this post, I have redone Jake's pairwise distance example to see how the same code performs 3 years forward. I also discuss a common finance example of amortizing cashflow generation. The numba and cython snippets are orders of magnitude faster than a pure python version. Surprisingly, numba is 20% to 300% faster than cython on these examples. There may very well be some cython tweaks I might be missing. But nevertheless these examples show how one can easily get performance boost using numba module.




   python   programming   development   pandas   numpy   numba  

Related Post