Advanced and Parallel Python

Finding Bottlenecks

Now that we have learned some techniques with no to minor code modifications, we might want to tweak our actual code itself since there are chances we can do better. The way to determine what part of our code needs attention is called profiling. Profilers come in a variety of shapes and functionalities but what we’ll use are two deterministic/event-based profilers as opposed to statistical profilers. This means that instead of relying on partial data by sampling at regular intervals, hit count will be exact and can be reproduced. The downside to using an event-based profiler is that is can slow down your code, sometimes significantly.

Here are some things to keep in mind before optimizing:

Make sure your code works/is correct beforehand.
Make sure you focus on the right code by profiling.
Make and keep performance measurements along the way.
Make the smallest changes possible at a time.
Keep track of your code changes using Git.
Always make sure your unit tests/your output is still valid.

We’ll start with a simple example that can be found in the profile.py file:

import sys
import time
import random

random.seed(time.time())

def gen_data(n):
    numbers = []
    for i in range(n):
        numbers.append(random.random())
    return numbers

def sum_nexts(numbers):
    sums = []
    for i in range(len(numbers)):
        for j in range(i+1, len(numbers)):
            if len(sums) < i+1:
                sums.append(0.)
            sums[i] = sums[i] + numbers[j]
    return sums

def main(n):
    numbers = gen_data(n)
    sums = sum_nexts(numbers)
    return sums

if __name__ == '__main__':
    if len(sys.argv) != 2:
        sys.stderr.write("Usage: {0} <n>\n".format(sys.argv[0]))
        sys.exit(1)

    n = int(sys.argv[1])
    main(n)

This small program takes a number n as input, generates n random elements in a list called numbers, and compute a list of n elements called sums, such as:

Algorithm

The first thing we want to do is make note of the time it takes to run the program. We’ll start small, with a run of 10000 generated elements.

$ time python profile.py 10000

real    0m9.787s
user    0m9.745s
sys 0m0.025s

We’re now ready to make our first profiling measurement to find our (possible) bottlenecks. For that, we’ll use the recommended Python profiler: cProfile

$ python -m cProfile profile.py 10000

         50045059 function calls in 15.072 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 __future__.py:48(<module>)
        1    0.000    0.000    0.000    0.000 __future__.py:74(_Feature)
        7    0.000    0.000    0.000    0.000 __future__.py:75(__init__)
        6    0.000    0.000    0.000    0.000 hashlib.py:100(__get_openssl_constructor)
        1    0.016    0.016    0.016    0.016 hashlib.py:56(<module>)
        1    0.003    0.003   15.072   15.072 profile.py:1(<module>)
        1   11.894   11.894   15.042   15.042 profile.py:13(sum_nexts)
        1    0.000    0.000   15.046   15.046 profile.py:22(main)
        1    0.003    0.003    0.004    0.004 profile.py:7(gen_data)
        2    0.000    0.000    0.001    0.000 random.py:100(seed)
        1    0.006    0.006    0.024    0.024 random.py:40(<module>)
        1    0.000    0.000    0.000    0.000 random.py:655(WichmannHill)
        1    0.000    0.000    0.000    0.000 random.py:72(Random)
        1    0.000    0.000    0.000    0.000 random.py:805(SystemRandom)
        1    0.000    0.000    0.001    0.001 random.py:91(__init__)
        1    0.000    0.000    0.000    0.000 {_hashlib.openssl_md5}
        1    0.000    0.000    0.000    0.000 {_hashlib.openssl_sha1}
        1    0.000    0.000    0.000    0.000 {_hashlib.openssl_sha224}
        1    0.000    0.000    0.000    0.000 {_hashlib.openssl_sha256}
        1    0.000    0.000    0.000    0.000 {_hashlib.openssl_sha384}
        1    0.000    0.000    0.000    0.000 {_hashlib.openssl_sha512}
        1    0.000    0.000    0.000    0.000 {binascii.hexlify}
        2    0.001    0.000    0.001    0.000 {function seed at 0x1018aa398}
        6    0.000    0.000    0.000    0.000 {getattr}
        6    0.000    0.000    0.000    0.000 {globals}
 50005002    2.881    0.000    2.881    0.000 {len}
        1    0.000    0.000    0.000    0.000 {math.exp}
        2    0.000    0.000    0.000    0.000 {math.log}
        1    0.000    0.000    0.000    0.000 {math.sqrt}
    19999    0.001    0.000    0.001    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    10000    0.001    0.000    0.001    0.000 {method 'random' of '_random.Random' objects}
        1    0.000    0.000    0.000    0.000 {method 'union' of 'set' objects}
        1    0.000    0.000    0.000    0.000 {posix.urandom}
    10002    0.266    0.000    0.266    0.000 {range}
        1    0.000    0.000    0.000    0.000 {time.time}

You must remember that you can’t compare run timings from profiled and non-profiled code. Profiling incurs an overhead and will usually be a lot slower (about 50% here). The default output is not useful as entries are in no particular order. What we want most of the time is function calls sorted by their reverse cumulative time. This can be accomplished by the following:

$ python -m cProfile -o output.cprofile profile.py 10000
$ python
>>> import pstats; s = pstats.Stats("output.cprofile")
>>> s.sort_stats('cumulative').print_stats(10)
Wed Feb 10 10:28:35 2016    output.cprofile

         50045059 function calls in 15.208 seconds

   Ordered by: cumulative time
   List reduced from 36 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001   15.208   15.208 profile.py:1(<module>)
        1    0.000    0.000   15.202   15.202 profile.py:22(main)
        1   12.022   12.022   15.197   15.197 profile.py:13(sum_nexts)
 50005002    2.910    0.000    2.910    0.000 {len}
    10002    0.265    0.000    0.265    0.000 {range}
        1    0.002    0.002    0.005    0.005 /Users/laurent/anaconda/envs/python2/lib/python2.7/random.py:40(<module>)
        1    0.003    0.003    0.005    0.005 profile.py:7(gen_data)
        1    0.002    0.002    0.002    0.002 /Users/laurent/anaconda/envs/python2/lib/python2.7/hashlib.py:56(<module>)
    19999    0.001    0.000    0.001    0.000 {method 'append' of 'list' objects}
    10000    0.001    0.000    0.001    0.000 {method 'random' of '_random.Random' objects}

We also want to save the output to a (binary) file, which we do here using the -o option. That way, we can manipulate the stats data afterwards. We also used the print_stats function in this example, with the 10 parameter indicating that we want only the top 10 hits. It’s then possible to infer that most of the computation time, as expected, is taking place in the sum_nexts function. Beware that it might not always be so predictable and that you must profile before starting optimizing your code.

Also note the addition of the -o option, to redirect output to a file. This is almost always what you want when profiling a real software as there will be a lot of functions to display.

The problem when using cProfile is that it’s view is very high-level. We know where to start looking, but we don’t know the details of the function execution. To dig further, we need to use a line-based profiler called line_profiler. The line_profiler profiler needs manual instrumentation by adding a @profile decorator to functions we want to have a more detailed look at. This is mainly for performance reasons as profiling at line level the entire code would generally be too costly.

@profile
def sum_nexts(numbers):
    sums = []
    for i in range(len(numbers)):
        for j in range(i+1, len(numbers)):
            if len(sums) < i+1:
                sums.append(0.)
            sums[i] = sums[i] + numbers[j]
    return sums

We can start the profiler and run our program using the following:

$ kernprof -l -v profile.py 10000

Wrote profile results to profile.py.lprof
Timer unit: 1e-06 s

Total time: 31.2661 s
File: profile.py
Function: sum_nexts at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                           @profile
    14                                           def sum_nexts(numbers):
    15         1            3      3.0      0.0      sums = []
    16      2939         1714      0.6      0.0      for i in range(len(numbers)):
    17  25068105      8827075      0.4     28.2          for j in range(i+1, len(numbers)):
    18  25065166     10874013      0.4     34.8              if len(sums) < i+1:
    19      2939         3654      1.2      0.0                  sums.append(0.)
    20  25065166     11559601      0.5     37.0              sums[i] = sums[i] + numbers[j]
    21                                               return sums

This makes it clear that a lot of the time is consumed by the (innocuously looking) if len(sums) < i+1 statement. It is not slow per say, but called 25 million times, it becomes significant. A simple change we can make is to initialize our result list before with zeros. This change would look like this:

def sum_nexts(numbers):
    sums = [0.]*len(numbers)
    for i in range(len(numbers)):
        for j in range(i+1, len(numbers)):
            sums[i] = sums[i] + numbers[j]
    return sums

We removed the @profile decorator as we want to measure the impact of our change, by comparing to the original run time:

$ time python profile.py 10000

real    0m5.433s
user    0m5.410s
sys 0m0.017s

So we have a 50% gain just doing this small modification. Let’s see if we can do better. We add the @profile back and run the line-level profiler again:

$ kernprof -l -v profile.py 10000

Wrote profile results to profile.py.lprof
Timer unit: 1e-06 s

Total time: 8.33541 s
File: profile.py
Function: sum_nexts at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                           @profile
    14                                           def sum_nexts(numbers):
    15         1           28     28.0      0.0      sums = [0.]*len(numbers)
    16      1152          566      0.5      0.0      for i in range(len(numbers)):
    17  10853166      3649381      0.3     43.8          for j in range(i+1, len(numbers)):
    18  10852014      4685433      0.4     56.2              sums[i] = sums[i] + numbers[j]
    19                                               return sums

That inner loop (j variable) sure seems to be a problem. Instead of iterating from i+1 to len(number)-1, we could use Python splicing to see if we get better performance.

def sum_nexts(numbers):
    sums = [0.]*len(numbers)
    for i in range(len(numbers)):
        sums[i] = sum(numbers[i+1:])
    return sums

Having removed the @profile decorator again, we can test our hypothesis:

$ time python profile.py 10000

real    0m0.472s
user    0m0.460s
sys 0m0.010s

That’s great! A 20x improvement since the beginning of the process. Let’s see what our line-based profiler has to say:

@profile
def sum_nexts(numbers):
    sums = [0.]*len(numbers)
    for i in range(len(numbers)):
        sums[i] = sum(numbers[i+1:])
    return sums

$ kernprof -l -v profile.py 10000

Wrote profile results to profile.py.lprof
Timer unit: 1e-06 s

Total time: 0.4534 s
File: profile.py
Function: sum_nexts at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                           @profile
    14                                           def sum_nexts(numbers):
    15         1           27     27.0      0.0      sums = [0.]*len(numbers)
    16     10001         3639      0.4      0.8      for i in range(len(numbers)):
    17     10000       449733     45.0     99.2          sums[i] = sum(numbers[i+1:])
    18         1            1      1.0      0.0      return sums

This is what we want to see: almost all run time is the actual computation (sum), which is great. But still, since we are working with lists, why not use Numpy arrays and see what we can achieve? Here is a modified version using Numpy arrays instead is lists:

import numpy

numpy.random.seed(int(time.time()))

def gen_data(n):
    return numpy.random.random(n)

def sum_nexts(numbers):
    sums = numpy.zeros(len(numbers))
    for i in range(len(numbers)):
        sums[i] = sum(numbers[i+1:])
    return sums

But running it yields an unexpected result. We are far worse than we were before those modifications:

$ time python profile.py 10000

real    0m4.331s
user    0m4.292s
sys 0m0.034s

Why? Let’s ask our profiler:

$ kernprof -l -v profile.py 10000

Wrote profile results to profile.py.lprof
Timer unit: 1e-06 s

Total time: 4.21296 s
File: profile.py
Function: sum_nexts at line 10

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    10                                           @profile
    11                                           def sum_nexts(numbers):
    12         1           85     85.0      0.0      sums = numpy.zeros(len(numbers))
    13     10001         4114      0.4      0.1      for i in range(len(numbers)):
    14     10000      4208765    420.9     99.9          sums[i] = sum(numbers[i+1:])
    15         1            0      0.0      0.0      return sums

This is again expected, we spend all of our time doing the sum. But why is it slower? The answer is that we are still using the Python built-in sum function and not it’s Numpy counterpart. That means the code goes back and forth between Python and C (Numpy). We can make the code execute longer in the Numpy library by using its own sum function:

def sum_nexts(numbers):
    sums = numpy.zeros(len(numbers))
    for i in range(len(numbers)):
        sums[i] = numbers[i+1:].sum()
    return sums

$ time python profile.py 10000

real    0m0.145s
user    0m0.112s
sys 0m0.029s

So in the end, we could get about 65 times faster in a really short amount of time.

More Data Processing

Try optimizing the exercices/process_data.py script by first using a profiler and finding hotspots.

Tip: Before running the script, generate a random sample.

cd exercices/ && python gen_inputs.py

A possible solution can be found in the solutions/process_data.py file.