Brief Introduction to Profiling

Let’s start with a quick introduction to what profiling is. Profiling is a run-time program analysis technique. Generally, a certain level of instrumentation is required to retrieve some kind of tracing information while the program is running. This is usually in the form of tracing instructions interleaved with the line of your source code, like debug statements, for instance, usually enriched with timestamp information or other relevant details, like memory usage, etc… .

One normally distinguishes between two main categories of profilers:

event-based (or deterministic)

and

statistical (or sampling).

Profilers in the first category make use of hooks that allow registering event callbacks. At the lowest level, these hooks are provided directly by the operating system and allow you to trace events like function calls and returns. Virtual machines and interpreters, like JVM and CPython, provide software hooks instead, for generally the same events, but also for language-specific features, like class loading for instance. The reason why profilers in this category are called deterministic is that, by listening to the various events, you can get a deterministic view of what is happening inside your application.

In contrast, statistical profilers tend to provide approximate figures only, obtained by, e.g., sampling the call stack at regular interval of times. These samples can then be analysed statistically to provide meaningful metrics for the profiled target.

One might get the impression that deterministic profilers are a better choice than statistical profilers. However, both categories come with pros and cons. For example, statistical profilers usually require less instrumentation, if none at all, and introduce less overhead in the profiled target program. Therefore, if a statistical profiler can guarantee a certain accuracy on the metrics that can be derived from them, then it is usually a better choice over a more accurate deterministic profiler that can introduce higher overhead.

Python Profiling

There are quite a plethora of profiling tools available for Python, either deterministic or statistical. The official documentation describes the use of the Python profiling interface through two different implementations:

The former is a pure Python module and, as such, introduces more overhead than the latter, which is a C extension that implements the same interface as profile. They both fit into the category of deterministic profilers and make use of the Python C API PyEval_SetProfile to register event hooks.

Standard Python Profiling

Let’s have a look at how to use cProfile, as this will be the standard choice for a deterministic profiler. Here is an example that will profile the call-stack of psutil.process_iter.

# File: process_iter.py

import cProfile
import psutil

cProfile.run(
  '[list(psutil.process_iter()) for i in range(1_000)]',
  'process_iter'
)

The above code runs psutil.process_iter for 1000 times through cProfile and sends the output to the process_iter file in the current working directory. A good reason to save the result to a file is that one can then use a tool like gprof2dot to provide a graphical representation of the collected data. This tool turns the output of cProfile into a dot graph which can then be visualised to make better sense of it. E.g., these are the commands required to collect the data and visualise it in the form of a DOT graph inside a PDF document:

$ 
$ 
python3 process_iter.py
gprof2dot -f pstats process_iter | dot -Tpdf -o process_iter.pdf

This is what the result will look like. The colours help us identify the branches of execution where most of the time is spent.

process_iter

A Look Under the Bonnet

The output of a tool like gprof2dot can be quite intuitive to understand, especially if you have had some prior experience with profilers. However, in order to better appreciate what is still to come it is best if we have a quick look at some of the basics of the Python execution model.

Python is an interpreted language and the reference implementation of its interpreter is CPython. As the name suggests, it is written in C, and it offers a C API that can be used to write C extensions.

One of the fundamental objects of CPython is the interpreter itself, which has a data structure associated with it, namely PyInterpreterState. In principle, there can be many instances of PyInterpreterState within the same process, but for the sake of simplicity, we shall ignore this possibility here. One of the fields of this C data structure is tstate_head, which points to the first element of a doubly-linked list of instances of the PyThreadState structure. As you can imagine, this other data structure represents the state of a thread of execution associated with the referring interpreter instance. We can navigate this list by following the references of its field next (and navigate back with prev).

Each instance of PyThreadState points to the current execution frame, which is the object that bears the information about the execution of a code block via the field frame. This is described by the PyFrameObject structure, which is also a list. In fact, this is the stack that we are after. Each frame will have, in general, a parent frame that can be retrieved by means of the f_back pointer on the PyFrameObject structure. The picture produced by gprof2dot of the previous section is the graphical representation of this stack of frames. The information contained in the first row of each box comes from the PyCodeObject structure, which can be obtained from every instance of PyFrameObject via the f_code field. In particular, PyCodeObject allows you to retrieve the name of the file that contains the Python code being executed in that frame as well as its line number and the name of the context (e.g. the current function).

Sometimes the C API changes between releases, but the following image is a fairly stable representation of the relations between the above-mentioned structures that are common among many of the major CPython releases.

CPython data structures

The loop around PyFrameObject, which represents its field f_back, creates the structure of a singly-linked list of frame objects. This is precisely the frame stack.

The Python profiling API can be demonstrated with some simple Python code. The following example declares a decorator, @profile, that can be used to extract the frame stack generated by the execution of a function. In this case, we define the factorial function

import sys


def profile(f):
    def profiler(frame, event, arg):
        if "c_" in event:
            return

        stack = []
        while frame:
            code = frame.f_code
            stack.append(f"{code.co_name}@{frame.f_lineno}")
            frame = frame.f_back
        print("{:12} {}".format(event.upper(), " -> ".join(stack[::-1])))

    def wrapper(*args, **kwargs):
        old_profiler = sys.getprofile()

        sys.setprofile(profiler)
        r = f(*args, **kwargs)
        sys.setprofile(old_profiler)

        return r

    return wrapper


@profile
def factorial(n):
    if n == 0:
        return 1

    return n * factorial(n-1)


if __name__ == "__main__":
    factorial(3)

Note that the coding of the profiler function can be simplified considerably by using the inspect module:

import inspect

...

  def profiler(frame, event, arg):
      if "c_" in event:
          return

      stack = [f"{f.function}@{f.lineno}" for f in inspect.stack()[1:]]
      print("{:8} {}".format(event.upper(), " -> ".join(stack[::-1])))

...

Statistical Profiling

For a profiler from the statistical category, we have to look for external tools. In this case, The “standard” approach is to make use of a system call like setitimer, which is used to register a signal handler that gets called at regular intervals of time. The general idea is to register a callback that gets a snapshot of the current frame stack when triggered. An example of a profiler that works like this is vmprof.

Some drawbacks of this approach are: 1. the signal handler runs in the same process as the Python interpreter, and generally the main thread; 2. signals can interrupt system calls, which can cause stalls in the running program.

There are other approaches that can be taken in order to implement a statistical profiler, though. An example is pyflame, which is more in the spirit of a debugging tool and uses ptrace-like system calls. The situation is a bit more involved here since the profiler is now an external process. The general idea is to use ptrace to pause the running Python program, read its virtual memory and reconstruct the frame stack from it. Here, the main challenges are 1. to find the location of the relevant CPython data structures in memory and 2. parse them to extract the frame stack information. The differences between Python 2 and Python 3 and the occasional changes of the CPython ABI within the same minor release add up to the complexity of the task.

Once all has been taken care of, though, a statistical profiler of this kind has the potential of lowering the overhead caused by source instrumentation even further so that the payoff is generally worth the effort.

Enter Austin

We just saw that with a tool like pyflame we can get away with no instrumentation. An objection that can be raised against it, though, is that it still halts the profiled program in order to read the interpreter state. System calls like ptrace were designed for debugging tools, for which it is desirable to stop the execution at some point, inspect memory, step over one instruction or a whole line of source code at a time etc…. Ideally, we would like our profiler to interfere as little as possible with the profiled program.

This is where a tool like Austin comes into play. Austin is, strictly speaking, not a full-fledged profiler on its own. In fact, Austin is merely a frame stack sampler for CPython. Concretely, this means that all Austin does is to sample the frame stack of a running Python program at (almost) regular intervals of time.

A similar approach is followed by py-spy, another Python profiler written in Rust and inspired by rbspy. However, Austin tends to provide higher performance in general for two main reasons. One is that it is written in pure C, with no external dependencies other than the standard C library. The other is that Austin is just a frame stack sampler. It focuses on dumping the relevant parts of the Python interpreter state as quickly as possible and delegates any data aggregations and analysis to external tools. In theory, Austin offers you higher sampling rates at virtually no cost at the expenses of the profiled process. This makes Austin the ideal choice for profiling production code at run-time, with not even a single line of instrumentation required!

So, how does Austin read the virtual memory of another process without halting it? Many platforms offer system calls to do just that. On Linux, for example, the system call is process_vm_readv. Once one has located the instance of PyInterpreterState, everything else follows automatically, as we saw with the discussion on some of the details of the CPython execution model.

On Your Marks

At this point you might have started, quite understandably, to be a bit concerned with concurrency issues. What can we actually make of a memory dump from a running process we have no control over? What guarantees do we have that the moment we decide to peek at the Python interpreter state, we find all the relevant data structures in a consistent state? The answer to this question lies in the difference in execution s