The Need for Speed

When time is of the essence, Python is not always the best choice. For a very long time this wasn’t a concern for me but lately I have had reasons to speed up my code, especially my Monte Carlo simulations. There’s a lot you can do in Python to speed up things, like spending time thinking about your code to find bottlenecks, vectorization, JIT compilers like Numba, Cython to name a few.

After doing some research I settled on Cython as I figured it would be a nice introduction to C-style coding. Eventually I decided I could just cut out the middle man and skip over to C++ as I noted it was fairly easy to move the computation-heavy part of my code to C++ and use a Cython wrapper to make it work within Python, and that even mediocre C++ could would likely be a lot faster than solid Python code.

To show the speedups possible with the different approaches, below are some code in Python, Cython and C++ (with a Cython wrapper) solving the exact same problems.

The Problem of Points

We actually won’t start with a simulation, though the code does contain some loops that will be sped up considerably when moving away from Python. Below is the analytical solution to the classic Problem of Points. I’ll let Wikipedia do the introduction before we move on to the code:

The problem of points, also called the problem of division of the stakes, is a classical problem in probability theory. One of the famous problems that motivated the beginnings of modern probability theory in the 17th century, it led Blaise Pascal to the first explicit reasoning about what today is known as an expected value.

The problem concerns a game of chance with two players who have equal chances of winning each round. The players contribute equally to a prize pot, and agree in advance that the first player to have won a certain number of rounds will collect the entire prize. Now suppose that the game is interrupted by external circumstances before either player has achieved victory. How does one then divide the pot fairly? It is tacitly understood that the division should depend somehow on the number of rounds won by each player, such that a player who is close to winning will get a larger part of the pot. But the problem is not merely one of calculation; it also involves deciding what a “fair” division actually is.

Python

Nowadays, a seasoned gambler would immediately realise that the fair division is that each player get a share of the pot equal to his chances of winning the game, given the current score. This wasn’t so obvious in the 15th century when the problem was first discussed but only realized far later by Pascal and Fermat. So how to compute each player’s chances of winning then? Given the nature of the game this is fairly trivial and below is a somewhat efficient Python solution using combinatorics. It could very likely be improved in terms of speed of execution but will serve its purpose as our baseline to compare other approaches with.

def binomial_coefficient(n, k):
    result = 1
    for i in range(k):
        result *= (n - i) / (i + 1)
    return result

def points_python(a, b, N):
    total_needed = N - a
    total_games = total_needed + (N - b) - 1
    probability = sum(
        binomial_coefficient(total_games, i)
        for i in range(total_needed, total_games + 1)
    )
    
    prob_a_wins = probability / (2 ** total_games)
    
    return prob_a_wins

So let’s try it out: given the score 10-8 in favor of player A, what is his chances of winning if they are set to play until someone reaches 16 points?

points_python(10,8,16)

0.70947265625

In this case, if the game were to be interrupted at this point, player A should receive roughly 71% of the pot. What if the game was played to 32 points instead?

points_python(10,8,32)

0.6170040878776035

This time player A only have about a 62% chance of winning if the game would have continued, which makes sense as he now needs 12 more points to win instead of 6.

import timeit

a, b, N = 10, 8, 32

def time_python():
    p = points_python(a, b, N)

python_time = timeit.timeit(time_python, number=100000)

print(f"Python time: {python_time:.2f}s")

Python time: 11.30s

With a working Python solution, we’ll run the code 100,000 times to see how long it takes and to set our baseline to compare with. 11 seconds might seem like a very fast solution for such a large number of games, but we can actually do much better with Cython.

Cython

Cython compiles Python code to C, which if done right can speed up things considerably. It’s a bit more complicated that just writing pure Python but there’s a lot of help along the way to optimize the code.

For example, doing %load_ext Cython in a separate Jupyter notebook cell allows you start a regular Python cell with %%cython and then proceed with you Cython code in that cell. Including the -a flag also give you a nice HTML file you can investigate for possible bottlenecks in your code. Here’s the Cython version of our previous Python code:

%%cython

cimport cython
from libc.math cimport pow

@cython.cdivision(True)
cdef double binomial_coefficient(int n, int k) nogil:
    cdef double result = 1
    cdef int i
    for i in range(k):
        result *= (n - i)
        result /= (i + 1)
    return result

@cython.cdivision(True)
cpdef float points_cython(int a, int b, int N):
    cdef int total_needed = N - a
    cdef int total_games = total_needed + (N - b) - 1
    cdef int i
    cdef double probability = 0
    cdef double term

    for i in range(total_needed, total_games + 1):
        term = binomial_coefficient(total_games, i)
        probability += term

    cdef double prob_a_wins = probability / pow(2, total_games)
    return prob_a_wins

Let’s make sure we get the same answer from this version as from the Python version:

points_cython(10,8,32)

0.6170040965080261

That’s very, very, very close to the number we got from Python so all good so far. Let’s see how fast the Cython code is:

def time_cython():
    p = points_cython(a, b, N)

cython_time = timeit.timeit(time_cython, number=100000)

print(f"Python time: {python_time:.2f}s")
print(f"Cython time: {cython_time:.2f}s")
print(f"\tSpeedup: {python_time / cython_time:.2f}x")

Python time: 11.33s
Cython time: 0.28s
    Speedup: 40.68x

That’s a whooping 40x speedup just by moving to Cython! Very cool, but can we do better? Let’s move on and see what C++ can do.

C++

Going from Python to Cython is fairly easy as Cython is basically just a very Python-like bridge to C++. Writing C++ code however is very different to Python so a bit more research is needed to get where we want.

// problem_of_points.cpp
#include <cmath>

double binomial_coefficient(int n, int k) {
    double result = 1;
    for (int i = 0; i < k; ++i) {
        result *= static_cast<double>(n - i) / (i + 1);
    }
    return result;
}

double points_cpp(int a, int b, int N) {
    int total_needed = N - a;
    int total_games = total_needed + (N - b) - 1;
    double probability = 0;

    for (int i = total_needed; i <= total_games; ++i) {
        probability += binomial_coefficient(total_games, i);
    }

    return probability / std::pow(2, total_games);
}

After writing our C++ implementation of the Problem of Points solution and saving it as .cpp file we can go ahead and write our Cython wrapper, again using the %%cython magic command, allowing us to compile our code inside the notebook without creating any additional files. In our Cython code we basically just import our C++ file and use it to create a Python function, still leveraging all the good stuff from C++, like the speed.

%%cython
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11
# distutils: extra_link_args = -std=c++11
# cpp_problem_of_points.pyx

cdef extern from "/path/to/your/problem_of_points.cpp":
    double points_cpp(int a, int b, int N)

def points_py_cpp(int a, int b, int N):
    cdef double prob_a_wins = points_cpp(a, b, N)
    return prob_a_wins

So now we are finally ready to compare all three methods and see how much we can speed up the original Python code.

def time_cpp():
    p = points_py_cpp(a, b, N)
    
cpp_time = timeit.timeit(time_cpp,number=100000)

print(f"Python time: {python_time:.2f}s")
print(f"Cython time: {cython_time:.2f}s")
print(f"\tSpeedup: {python_time / cython_time:.2f}x")
print(f"c++ time: {cpp_time:.2f}s")
print(f"\tSpeedup: {python_time / cpp_time:.2f}x")

Python time: 11.30s
Cython time: 0.28s
    Speedup: 40.60x
c++ time: 0.14s
    Speedup: 80.72x

If the speedup by Cython was great, switching to C++ is fantastic. And we’ve done nothing fancy here, the C++ code is very basic and can probably be optimized even further. But we are gamblers and not developers so we’ll take the easy 80x speedup and go back to actually using our improved models to gain an edge.