How to Make Python Faster Without Trying That Much
Contents
This article is based on a demo I gave in a session at the Theoretical Systems Biology Group on 4th of February.
The example focuses on optimising a small program that computes the best alignment for two protein sequences.
Starting code
The code for this demo is available directly from GitHub.
Please checkout the starting-code
tag to begin with:
git clone git@github.com:sauliusl/python-optimisation-example.git
git checkout starting-code
Alternatively, you can browse the code here.
Take a look at the directory alignment, there are three files:
align.py
– the file will be the target of our optimisation today, as it contains the routine to compute the dynamic programming matrix;alignment.py
– the main executable that calls this function and prints the output.sequences.py
–the file that provide a convenient way to get the sequences.
Baseline
Before starting any optimisations, let’s verify whether the code works at all:
> python alignment.py
Score: 315
Alignment:
MEE-PQSD-PS-VE-PPLSQETFS-DLWKLL-PE--NNVL-SP--LPSQAMDD-LMLSP-
MEES-QSDI-SL-EL-PLSQETFSG-LWKLLPPEDI---LPSPHC-----MDDLL-L-PQ
<... snip ...>
KE-PG-GSRAHSS-HLK-SKKGQSTSRHKK-LM--FK-TEGPDSD
-ES-GD-SRAHSSY-LKT-KKGQSTSRHKKT-MVK-KV--GPDSD
As you can see the code aligns the sequences in some way.
The output seems reasonable given the simplicity of the alignment scores, which is great news for us.
We can now continue to establishing a baseline on how fast the code runs.
Unix command time
is perfect for this task.
> time python alignment.py
Typing this command into the terminal reports the baseline to be around 0.502s.
At this point it is curious to measure the runtime using an optimising python interpreter, PyPy. Download and install PyPy according to instructions in their download page. For those lucky enough to use OSX, Homebrew provides a simple interface to install PyPy:
> brew install pypy
Since we know PyPy is an optimising Python compiler, we would expect the code to run faster than on it than on CPython.
To test this assumption, we can again use time
:
> time pypy alingment.py
This function reports the code to run for 0.255s – almost twice as fast as the under the standard python. Thinking about this, it is quite impressive that we were able to nearly halve the runtime without any significant effort on our side (besides installing PyPy, of course).
Profiling the Baseline
Now let’s see if we can speed things up even further.
Let’s fire up cProfile
and investigate where the program spends majority of it’s time.
The standard way to run cProfile
is given below:
> python -m cProfile -o profile.pstats alignment.py
The -o profile.pstats
option tells the profiler to save the stats in profile.pstats
file, which we are going to view using gprof2dot
script and graphviz
.
If you do not yet have these packages, you should be able to install gprof2dot
using pip
directly:
> pip install gprof2dot
If your computer does not recognise pip
command, try to install it first using these instructions.
Since graphviz
is not a python package it cannot be installed via pip
, therefore follow the instructions in the download page for guidance on how to obtain it. Alternatively, you can use your distribution’s package manager. For instance, the aforementioned Homebrew on OSX can do this (brew install graphviz
).
Once you have both modules, run the following two-part command:
> gprof2dot -f pstats profile.pstats -n 0 -e 0 | dot -Tpdf -o profile.pdf
Here-n 0 -e 0
just tells the software to draw all nodes (by default it cuts a few unimportant ones off for visual clarity). The piped dot
command generates profile.pdf
with the profile of function calls for the program.
Open this profile.pdf
file with your favourite PDF viewer to see the results.
Improving the code based on profile results
From the profile above we can see that the majority of the time is spent in the align
function, just as we would expect, 35% of which in function max
.
Out of this time, we see that we spend 9.75% of time in the lambda
function, namely the lambda function lambda x: x[0]
that just tells the function to get the first value and use that for comparison.
Judging from these results, one must ask whether using max
function for three elements only is an overkill. M
aybe we were too smart here and could actually get away with simpler code? Lets replace the call for max function with a block of nested if
statements:
if match_score >= deletion and match_score >= insertion:
max_score = match_score
traceback_pos = i-1, j-1
elif deletion > match_score and deletion >= insertion:
max_score = deletion
traceback_pos = i-1, j
else:
max_score = insertion
traceback_pos = i, j-1
matrix[i, j] = max_score
traceback[i, j] = traceback_pos
To get this code just checkout simplified-initial
tag:
> git checkout simplified-initial
Alternatively, browse it here.
Let’s time
this new code:
> time python alignment.py
0m0.327s
> time pypy alignment.py
0m0.249s
We can see that CPython version of the code became much faster (remember, it was around 0.5s before), whereas PyPy runtimes stayed roughly the same, suggesting that PyPy has already thought of this optimisation on it’s own.
Running cProfile again, we can see that align function is still the bottleneck. Unfortunately, it does not tell us what is making it slow any more:
Profiling line-by-line
We will use a line profiler to get more granularity for our profiles.
First install it if you haven’t already:
> pip install line_profiler
Now in order for the profiler to work, we need to tell it which functions we want to profile. This is done by decorating said functions with @profile
decorator.
Open align.py
in your favourite text editor and add the line @profile
on top of the def align ...
line, i.e.:
# align.py
# < snip ... >
@profile
def align(left, right):
matrix = {}
#< snip ... >
Don’t worry about importing this line. Profiler will make sure to import it for you automagically.
The actual profiling is done by running
> kernprof.py --line-by-line alignment.py
This would store the results in alignment.py.lprof
file which can then be viewed by
> python -m line_profiler alignment.py.lprof
The first command in this set generates the profile, and the second one prints it in a readable format. This is the output:
Timer unit: 1e-06 s
File: align.py
Function: align at line 6
Total time: 1.38775 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
6 @profile
7 def align(left, right):
8
9 1 3 3.0 0.0 matrix = {}
10
11 # Initialise
12 1 1 1.0 0.0 matrix[0, 0] = 0
13 1 1 1.0 0.0 traceback = {}
14 394 369 0.9 0.0 for i in xrange(len(left)):
15 393 525 1.3 0.0 matrix[i+1, 0] = (i+1) * DELETION_SCORE
16 393 624 1.6 0.0 traceback[i+1, 0] = i, 0
17
18 388 405 1.0 0.0 for i in xrange(len(right)):
19 387 477 1.2 0.0 matrix[0, i+1] = (i+1) * INSERTION_SCORE
20 387 532 1.4 0.0 traceback[0, i+1] = 0, i
21
22
23
24 394 338 0.9 0.0 for i, symbol_left in enumerate(left):
25 152484 140777 0.9 10.1 for j, symbol_right in enumerate(right):
26 152091 125126 0.8 9.0 match_score = matrix[i, j] + MATCH_SCORE if symbol_left == symbol_right else MISMATCH_SCORE
27 152091 155781 1.0 11.2 deletion = matrix[i, j+1] + DELETION_SCORE
28 152091 146020 1.0 10.5 insertion = matrix[i+1, j] + INSERTION_SCORE
29 152091 111502 0.7 8.0 if match_score >= deletion and match_score >= insertion:
30 9563 6786 0.7 0.5 max_score = match_score
31 9563 7496 0.8 0.5 traceback_pos = i, j
32 142528 114515 0.8 8.3 elif deletion > match_score and deletion >= insertion:
33 87713 62775 0.7 4.5 max_score = deletion
34 87713 72069 0.8 5.2 traceback_pos = i, j+1
35 else:
36 54815 38668 0.7 2.8 max_score = insertion
37 54815 44198 0.8 3.2 traceback_pos = i+1, j
38
39 152091 171714 1.1 12.4 matrix[i+1, j+1] = max_score
40 152091 187046 1.2 13.5 traceback[i+1, j+1] = traceback_pos
41
42 1 2 2.0 0.0 return matrix[len(left), len(right)], traceback
We can see we spend a significant amount of time in lines that access the matrix, i.e. lines 26-28, 39. Similarly, a significant amount of time is spent in the comparisons and other parts of the loop. Optimisation of any of these lines will provide the most benefit. Let’s see if we can find a way to perform these optimisations.
Cythonising the code
Cython can be considered to be a compiler for Python.
Generally, it is able to convert the Python code into C, which should provide a significant performance boost for simple statements such as element access in a matrix.
In order to start using it, we would need to install it first. pip
would be able to handle this for us:
> pip install Cython
Once the installation is complete, it is ridiculously easy to convert the code to Cython format:
> cp align.py align.pyx
You got it right, we just copied the file to a file with different extension (.pyx
is Cython file extension).
In order to be able to run it, however, we would need to compile this file. This is done by creating setup.py
.
The following code provides the minimal skeleton for such file:
1 # setup.py
2 from distutils.core import setup
3 from distutils.extension import Extension
4 from Cython.Distutils import build_ext
5
6 setup(
7 cmdclass = {'build_ext': build_ext},
8 ext_modules = [Extension("align_cythonised", ["align.pyx"])]
9 )
Line eight in this file can roughly be translated to say compile file align.pyx
into a library align_cythonised
.
Now let’s quickly compile our code using this setup file:
> python setup.py build_ext --inplace
The code should generate align.c
and align_cythonised.so
files in your local directory (do not forget the --inplace
tag, as otherwise it would compile it into build/
directory).
We could now use timeit
module to compare the runtimes of these functions.
It is convenient to use IPython (http://ipython.org/, pip install ipython
) for timeit
tests due to it’s magic command %timeit
. I use it here:
In [1]: from sequences import P53_HUMAN, P53_MOUSE
In [2]: from align import align
In [3]: from align_cythonised import align as align_cythonised
In [4]: %timeit align(P53_HUMAN, P53_MOUSE)
1 loops, best of 3: 258 ms per loop
In [5]: %timeit align_cythonised(P53_HUMAN, P53_MOUSE)
1 loops, best of 3: 189 ms per loop
We can see we got 69 ms improvement by just compiling the code without changing a line of code – not bad. Note that PyPy runs this function in 168 ms – 21 milliseconds faster. One could overstretch this result and say that Python runs faster than C in this scenario (hehe).
Let’s see if we can make Cythonised code beat this though.
Numpy and Cython
Cython collaborates with numpy
folks a lot and is able to run numpy
’s C interface natively. We could exploit this native interface to our advantage.
To do so, we will change our matrix to be a numpy n-dimensional array rather than a dictionary.
First of all, we need to install numpy
, pip
could be used to do this yet again:
> pip install numpy
You might need a Fortran compiler to do so (you can obtain it from Homebrew on OSX brew install gfortran
). If you are having trouble with this step consult the downloads page on how to get the precompiled library directly.
Once this is all set, copy align.pyx
into align_numpy.pyx
and change it slightly to use numpy
.
The modified file is available from GitHub.
Git tag cython-final
could be used to checkout this code:
> git checkout cython-final
In the first couple of lines in the align_numpy.pyx
we import both pythonic and c libraries for numpy:
import numpy as np # Import python numpy bindings
cimport numpy as np # Import cpython numpy bindings
Then, we define the matrix to be a numpy
array (rather than a dictionary), and initialise it:
cdef np.ndarray[np.float_t, ndim=2] matrix
matrix = np.empty((len_left+1, len_right+1))
Then, we explicitly define that our iterators, i and j, are unsigned integers:
cdef unsigned int i, j
This step is crucial, as numpy array accesses are optimised to native C speeds only for typed indexes.
We add a couple of other cdef
hints, and are ready to go.
In order to compile this new file, setup.py
needs to be changed a little bit as well,
the file is available in GitHub.
All we did here was to add another Extension
line to make the setup.py
aware of this new file.
Extension("align_cythonised_numpy", ["align_numpy.pyx"],
include_dirs=[numpy.get_include()])]
We also make sure to include numpy
headers to the compilation, the include_dirs
parameter takes care of this.
We can now check how fast this code is running by using %timeit
magic function, after appropriately importing the numpy-cythonised function:
In [7]: %timeit align_cythonised_numpy(P53_HUMAN, P53_MOUSE)
10 loops, best of 3: 101 ms per loop
As you can see, the total runtime is 101 milliseconds. This is 157ms faster than the original version of the code and all we did was to start using numpy and add a couple of type hints.
This Cythonic code can be further optimised by enabling the potentially dangerous options such as disabling bound checking. Furthermore, if you want to take optimisation matters into your own hands, one could run native C code from Cython. I refer the reader to Cython documentation and various tutorial videos from SciPy conferences for more information on these advanced topics.
Conclusions
In conclusion, this guide provides a reference on some quick optimisations you could make for your code, without much significant effort from your side. Were able to significantly improve runtimes of Python code by profiling where the bottlenecks are and doing simple, yet incremental changes to it.
A word of warning, though, these alternative interpreters/compilers for python should be considered only as a last resort. More often than not, there is a way to obtain sufficient performance boosts from simply changing the bits of the code, as we did in the beginning. However, if in any case this would prove insufficient for your needs, these optimisation tools exist already and will continue to improve over time.