Lecture 27: Process control, multiprocessing, and fast code

CBIO (CSCI) 4835/6835: Introduction to Computational Biology

Overview and Objectives

As a final lecture, we'll go over how to extend the reach of your Python code beyond the confines of the script itself and interact with the computer and other programs. Additionally, we'll look at ways of speeding up your Python code. By the end of this lecture, you should be able to:

  • Implement entire pipelines of Python and external programs, using Python as the "glue"
  • Use subprocess to invoke arbitrary external programs
  • Reroute input and output to go through Python, and understand the issues that arise
  • Explain the Dining Philosophers problem in the context of multiprocessing
  • Use the multiprocessing and joblib libraries to write embarrassingly parallel code
  • Use the numba library to speed up computation-heavy code

Part 1: Going outside the [Python] box

Python has lots of tools and packages to help you perform whatever analysis or function you want to do.

But sometimes you need to integrate with programs that don't have a Python interface.

Or maybe you just love the command prompt that much.

Especially in computational biology, there are frequent examples of needing to interface with programs outside of Python.

  • Running molecular dynamics simulations
  • Retrieving digital information from a database
  • Moving files and folders around a computer
  • Invoking some external, specialized program to run an analysis, then using Python to synthesize the results

Python has a versatile subprocess module for calling and interacting with other programs.

However, first the venerable system command:

In [1]:
import os
os.system("curl www.cnn.com -o cnn.html")
Out[1]:
0

For simple commands, this is great. But where it quickly wears out its welcome is how it handles what comes back from the commands: the return value of the system command is the exit code, not what is printed to screen.

In [2]:
f = open('cnn.html')
len(f.read())
Out[2]:
136828

What exit code indicates success?

  • -1
  • 0
  • 1
  • empty string

subprocess

The subprocess module replaces the following modules (so don't use them):

os.system
os.spawn*
os.popen*
popen2.*
commands.*

Think of subprocess as a more powerful version of all these.

The most basic function in the subprocess module is run.

The first and only required argument is a list: it's the command-line command and all its arguments.

Remember commands like ls? cd? pwd? Think of these commands as functions--they can also have arguments.

  • If you don't want to give any arguments, you can give subprocess.run a list with one element: just the command you want to run.
  • If you do provide arguments, you give subprocess.run a list with multiple elements, where the first element is the command to run, and the subsequent elements are the arguments to that command.

Examples

Let's see some examples, shall we?

In [3]:
import subprocess
subprocess.run(["ls"])
Out[3]:
CompletedProcess(args=['ls'], returncode=0)
In [4]:
subprocess.run(["touch", "test.txt"])
Out[4]:
CompletedProcess(args=['touch', 'test.txt'], returncode=0)
In [5]:
subprocess.run(["echo", "something", ">>", "test.txt"])
Out[5]:
CompletedProcess(args=['echo', 'something', '>>', 'test.txt'], returncode=0)
In [6]:
subprocess.run(["cat", "test.txt"])
Out[6]:
CompletedProcess(args=['cat', 'test.txt'], returncode=0)

If there's some kind of oddity with the command you're trying to run, you'll get a nonzero exit code back.

In [7]:
subprocess.run(['ls','file with spaces'])
Out[7]:
CompletedProcess(args=['ls', 'file with spaces'], returncode=0)

What's wrong here?

If the filename really has spaces, you need to "escape" the filename by using shell = True:

In [8]:
subprocess.run(['ls', 'file with spaces'], shell = True)
Out[8]:
CompletedProcess(args=['ls', 'file with spaces'], returncode=0)

If you're trying to run ls on three separate files: file, with, and spaces, you need to separate them into distinct list elements:

In [9]:
subprocess.run(['ls', 'file', 'with', 'spaces'])
Out[9]:
CompletedProcess(args=['ls', 'file', 'with', 'spaces'], returncode=0)

This is all well and good--I can run commands and see whether or not they worked--but usually when you run external programs, it's to generate some kind of output that you'll then want Python to use.

How do we access this output?

First, we'll need to introduce a new concept to all programming languages: input and output streams.

Standard streams

Otherwise known as "standard input", "standard output", and "standard error", or in programming parlance:

  • stdin - standard input is usually from the keyboard
  • stdout - standard output, usually from print() statements
  • stderr - standard error, usually from errors

stdin, stdout, and stderr specify the executed program's standard input, standard output and standard error file handles, respectively.

We have to redirect these streams within subprocess so we can see them from inside Python.

Redirecting to files

In [10]:
f = open('dump','w')
subprocess.run('ls', stdout = f)

f = open('dump','r') #this would be a very inefficient way to get the stdout of a program
print(f.readlines())
['Final-Project.ipynb\n', 'Lecture1.ipynb\n', 'Lecture12.ipynb\n', 'Lecture13.ipynb\n', 'Lecture14\n', 'Lecture14.ipynb\n', 'Lecture15.ipynb\n', 'Lecture19\n', 'Lecture19.ipynb\n', 'Lecture2.ipynb\n', 'Lecture20\n', 'Lecture20.ipynb\n', 'Lecture21\n', 'Lecture21.ipynb\n', 'Lecture22\n', 'Lecture22.ipynb\n', 'Lecture23\n', 'Lecture23.ipynb\n', 'Lecture24.ipynb\n', 'Lecture26\n', 'Lecture26.ipynb\n', 'Lecture27.ipynb\n', 'Lecture3.ipynb\n', 'Lecture4.ipynb\n', 'Lecture5.ipynb\n', 'Lecture6\n', 'Lecture6.ipynb\n', 'Lecture7\n', 'Lecture7.ipynb\n', 'Lecture8.ipynb\n', 'Lecture9\n', 'Lecture9.ipynb\n', 'MA0004.1.sites\n', 'args.py\n', 'bnip3.fasta\n', 'brca1.fasta\n', 'cnn.html\n', 'dump\n', 'file\n', 'file with spaces\n', 'hydra.fasta\n', 'hydra179.dnd\n', 'p53.gb\n', 'spaces\n', 'test.txt\n', 'with\n']
In [11]:
f = open('dump','w')
subprocess.run(['ls','nonexistantfile'], stdout = f, stderr = subprocess.STDOUT) #you can redirect stderr to stdout
print(open('dump').read())
ls: nonexistantfile: No such file or directory

Yes, that's the error message from the command line. Rather than showing up on the command line, we've captured stderr to be the stdout of the Python subprocess, and then redirected the stdout to the file. Hence, the error message is in the file!

What can the targets of the stdout and stderr arguments be? Valid targets, as we've seen, include

  • an existing file object, created with open()
  • nothing at all, in which case the program will default to the existing stdin/stdout/stderr
  • subprocess.PIPE, which enables communication directly between your script and the program (BE CAREFUL WITH THIS)

subprocess.Popen

All the previous functions are just convenience wrappers around the Popen object.

In addition to accepting the command and its arguments as a list, and the stdout and stderr optional arguments, Popen includes the cwd argument, which sets the working directory of the process, or defaults to the current working directory of the Python script.

In [12]:
proc = subprocess.Popen(["ls"], stdout = subprocess.PIPE)
print(proc.stdout.readlines())
[b'Final-Project.ipynb\n', b'Lecture1.ipynb\n', b'Lecture12.ipynb\n', b'Lecture13.ipynb\n', b'Lecture14\n', b'Lecture14.ipynb\n', b'Lecture15.ipynb\n', b'Lecture19\n', b'Lecture19.ipynb\n', b'Lecture2.ipynb\n', b'Lecture20\n', b'Lecture20.ipynb\n', b'Lecture21\n', b'Lecture21.ipynb\n', b'Lecture22\n', b'Lecture22.ipynb\n', b'Lecture23\n', b'Lecture23.ipynb\n', b'Lecture24.ipynb\n', b'Lecture26\n', b'Lecture26.ipynb\n', b'Lecture27.ipynb\n', b'Lecture3.ipynb\n', b'Lecture4.ipynb\n', b'Lecture5.ipynb\n', b'Lecture6\n', b'Lecture6.ipynb\n', b'Lecture7\n', b'Lecture7.ipynb\n', b'Lecture8.ipynb\n', b'Lecture9\n', b'Lecture9.ipynb\n', b'MA0004.1.sites\n', b'args.py\n', b'bnip3.fasta\n', b'brca1.fasta\n', b'cnn.html\n', b'dump\n', b'file\n', b'file with spaces\n', b'hydra.fasta\n', b'hydra179.dnd\n', b'p53.gb\n', b'spaces\n', b'test.txt\n', b'with\n']
In [13]:
proc = subprocess.Popen(['ls'], stdout = subprocess.PIPE, cwd = "/Users/squinn")
print(proc.stdout.readlines())
[b'Applications\n', b'Desktop\n', b'Documents\n', b'Downloads\n', b'Dropbox\n', b'Google Drive\n', b'Library\n', b'Movies\n', b'Music\n', b'Pictures\n', b'Programming\n', b'Public\n', b'SpiderOak Hive\n', b'metastore_db\n', b'nltk_data\n', b'rodeo.log\n']

subprocess.PIPE

So what is this mysterious PIPE attribute?

"Pipes" are a common operating system term for avenues of communication between different programs. In this case, a pipe is established between your Python program and whatever program you're running through subprocess.

If stdout/stdin/stderr is set to subprocess.PIPE, then that input/output stream of the external program is accessible through a file object in the returned object.

It's a regular ol' file object, so you have access to all the Python file object functions you know and love:

In [14]:
proc = subprocess.Popen(['ls'], stdout = subprocess.PIPE)
In [15]:
print(proc.stdout.readline())
b'Final-Project.ipynb\n'
In [16]:
print(proc.stdout.readline())
b'Lecture1.ipynb\n'
In [17]:
for elem in proc.stdout.readlines():
    print(elem)
b'Lecture12.ipynb\n'
b'Lecture13.ipynb\n'
b'Lecture14\n'
b'Lecture14.ipynb\n'
b'Lecture15.ipynb\n'
b'Lecture19\n'
b'Lecture19.ipynb\n'
b'Lecture2.ipynb\n'
b'Lecture20\n'
b'Lecture20.ipynb\n'
b'Lecture21\n'
b'Lecture21.ipynb\n'
b'Lecture22\n'
b'Lecture22.ipynb\n'
b'Lecture23\n'
b'Lecture23.ipynb\n'
b'Lecture24.ipynb\n'
b'Lecture26\n'
b'Lecture26.ipynb\n'
b'Lecture27.ipynb\n'
b'Lecture3.ipynb\n'
b'Lecture4.ipynb\n'
b'Lecture5.ipynb\n'
b'Lecture6\n'
b'Lecture6.ipynb\n'
b'Lecture7\n'
b'Lecture7.ipynb\n'
b'Lecture8.ipynb\n'
b'Lecture9\n'
b'Lecture9.ipynb\n'
b'MA0004.1.sites\n'
b'args.py\n'
b'bnip3.fasta\n'
b'brca1.fasta\n'
b'cnn.html\n'
b'dump\n'
b'file\n'
b'file with spaces\n'
b'hydra.fasta\n'
b'hydra179.dnd\n'
b'p53.gb\n'
b'spaces\n'
b'test.txt\n'
b'with\n'

Core Popen functions

  • Popen.poll() - check to see if process has terminated
  • Popen.wait() - wait for process to terminate (basically, ask your Python program to hang until the command is finished)
  • Popen.terminate() - terminate the process (ask nicely)
  • Popen.kill() - kill the process with extreme prejudice
In [18]:
proc = subprocess.Popen('cat', stdin = subprocess.PIPE, stdout = subprocess.PIPE)

We've created two pipes between our Python program and the cat command--an input pipe to stdin, and an output pipe from stdout.

In [19]:
proc.stdin.write(bytes("Hello", encoding = 'utf-8'))
proc.stdin.close()

Here, we've written some raw bytes to the input--to the cat program, this looks like it's coming from the keyboard!

In [20]:
print(proc.stdout.read())
b'Hello'

Now we're reading the output of the cat program!

What would happen if in the previous code we omitted the proc.stdin.close() call?

  • Nothing would change
  • It would not print anything
  • It would print Hello after a pause
  • It would hang

Warning!

Managing simultaneous input and output is tricky and can easily lead to deadlocks.

For example, your script may be blocked waiting for output from the process which is blocked waiting for input.

The Dining Philosophers

Part 2: Multitasking and parallal programming

There are several parallel programming models enabled by a variety of hardware (multicore, cloud computing, supercomputers, GPU).

Threads vs. Processes

A thread of execution is the smallest sequence of programmed instructions that can be managed independently by an operating system scheduler.

A process is an instance of a computer program.

The long and short of threads versus processes in Python is...

...always use processes.

Blah blah blah Global Interpreter Lock, blah blah blah only 1 thread allowed per Python process, blah blah blah just use multiprocessing.

Multiple simultaneous processes

Or multiprocessing.

This is the concept of parallel programming, or having your program do multiple things at the very same time.

With the rising popularity of multi-core computers, most computers these days can easily handle at least 4 parallel processes at once. Why not take advantage of that?

However, writing correct, high performance parallel code can be difficult: look no further than the Dining Philosophers problem.

...but in some cases, it's trivial.

A problem is embarassingly parallel if it can be easily separated into independent subtasks, each of which does a substantial amount of computation.

Fortunately, this happens a lot!

  • Apply this filter to 1000 images
  • Process these 5000 protein structures
  • Compute RMSDs of all frames in a trajectory

In cases like these, using Pools will get you a significant speedup (by however many cores you have).

Pools

multiprocessing supports the concept of a pool of workers. You initialize with the number of processes you want to run in parallel (the default is the number of CPUs on your system) and they are available for doing parallel work:

Then (and this is the somewhat-tricky part), you call map() and pass in two arguments:

  • the function you want to call in parallel, and
  • the values you want the function to evaluate in parallel

Clear as mud? Let's see an example.

Pool Example

In [21]:
import multiprocessing
In [22]:
def f(x):
    return x ** 2

pool = multiprocessing.Pool(processes = 4)

numbers_to_evaluate = range(20)

print(pool.map(f, numbers_to_evaluate))
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]

Inter-process communication

While 90% of the work you'll likely do is embarrassingly parallel, some multiprocessing can't be done just with Pools.

Or, perhaps, you'll need to be able to communicate intermediate results between processes.

We can do this through queues and pipes.

Queues

multiprocessing.Queue provides a simple first-in-first-out messaging queue between Python processes.

  • put: put an element on the queue. This will block if the queue has filled up
  • get: get an element from the queue. This will block if the queue is empty.

Queues are great if all you want to do is basically "report" on the progress of a process. The process puts in updates, e.g. "20% finished", and the main Python script gets these updates and prints them out to you.

Pipes

A pipe is a communication channel between processes that can send and receive messages.

Pipe() return a tuple of Connection objects representing the ends of the pipe. Each connection object has the following methods:

  • send: sends data to other end of the pipe
  • recv: waits for data from other end of the pipe (unless pipe closed, then EOFError)
  • close: close the pipe

Unlike queues, pipes can support two-way communication between processes.

In [23]:
import multiprocessing

def chatty(conn): #this takes a Connection object representing one end of a pipe
    msg = conn.recv()
    conn.send("you sent me '" + msg + "'")
In [24]:
# Create the two ends of the pipe.
(c1,c2) = multiprocessing.Pipe()
In [25]:
# Spin off the process that runs the "Chatty" function.
p1 = multiprocessing.Process(target = chatty, args = (c2, ))
p1.start()
In [26]:
# Send a message to the process, and receive its response.
c1.send("Hello!")
result = c1.recv()
p1.join()

print(result)
you sent me 'Hello!'

joblib

joblib is a wonderful package that uses multiprocessing on the backend, but simplifies things greatly by removing a lot of boilerplate.

The most likely source of duplicated effort is in loops.

In [27]:
import numpy as np

def identity(value):
    return np.sqrt(value ** 2)
In [28]:
# Now do some computation.
array = range(100000)
retval = []
for i in array:
    retval.append(identity(i))

An important observation: no specific value of the array depends on any other. This means it doesn't matter the order in which these computations are performed. Plus it takes forever to run this on 100,000 numbers, one after another.

So why not perform them at the same time?

With multiprocessing, we had to set up a Pool and a map. Not with joblib:

In [29]:
from joblib import Parallel, delayed

retval = Parallel(n_jobs = 8, verbose = 1)(delayed(identity)(i) for i in array)
[Parallel(n_jobs=8)]: Done 10176 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 100000 out of 100000 | elapsed:    0.9s finished

This is a bit tricky at first, but I promise it's more straightforward than multiprocessing.

Let's take the code bit by bit.

retval = Parallel(n_jobs = 8, verbose = 1)(delayed(identity)(i) for i in array)

  • Parallel(n_jobs = 8, verbose = 1): This sets up the parallel computation, specifying we want to use 8 separate processes. The verbose is just a logging argument--the higher the number, the more debugging output it spits out.
  • delayed(identity): This is a little syntax trick by joblib, but basically: whatever function you want to run in parallel, you pass in to the delayed() function.
  • (i): This is the argument you want to pass to the function you're running in parallel.
  • for i in array: and this is the loop through the data you want to process in parallel.

All the same pieces are there as in multiprocessing!

joblib just streamlines the process by assuming loops are your primary source of repetition (a good assumption), so it bakes its machinery into the loop structure.

Anytime you do parameter scans, data point preprocessing, anything that is "embarrassingly parallel" or that would use multiprocessing.Pool, use joblib instead.

Part 3: Extreme Python

Just a quick look at one of the more cutting-edge Python packages: numba

Let's say you're trying to compute the Frobenius norm on an alignment matrix from a molecular dynamics simulation.

(that's just a fancy way of saying "element-wise Euclidean distance")

In [30]:
def frob(matrix):
    rows = matrix.shape[0]
    cols = matrix.shape[1]
    
    frob_norm = 0.0
    for i in range(rows):
        for j in range(cols):
            frob_norm += matrix[i, j] ** 2
    
    return np.sqrt(frob_norm)

Let's see how it works.

In [31]:
import numpy as np

x1 = np.random.random((10, 10))  # A 10x10 random matrix
f1 = frob(x1)
print(f1)
5.71937796683

Cool. Seems to have worked reasonably well. Out of sheer curiosity, how long did that take to run?

In [32]:
%timeit frob(x1)
10000 loops, best of 3: 46 µs per loop

Not bad. $10^{-6}$ seconds per run.

How well does it scale if the matrix is an order of magnitude larger?

In [33]:
x2 = np.random.random((100, 100))  # A 100x100 random matrix!
f2 = frob(x2)
print(f2)
57.7465642187
In [34]:
%timeit frob(x2)
100 loops, best of 3: 4.12 ms per loop

Yikes--an order of magnitude in data size, but two orders of magnitude in runtime increase.

Let's try one more data size increase. I have a bad feeling about this...

In [35]:
x3 = np.random.random((1000, 1000))  # Yikes
f3 = frob(x3)
print(f3)
577.384362088
In [36]:
%timeit frob(x3)
1 loop, best of 3: 397 ms per loop

Another order of magnitude on the data, another two orders of magnitude on the runtime. Clearly not a good trend. Maybe a quadratic trend, in fact?

Point being, this code doesn't scale. At all.

Of course, the problem lies in the fact that you could be using NumPy array broadcasting. But let's say you didn't know about it.

Or, much more likely, it's a very small part of a much larger scientific program--complete with subprocesses and multiprocessing--and it's going to be tough to isolate a single part and optimize it.

"Just-in-time" compilation to the rescue!

In [37]:
from numba import jit

@jit
def frob2(matrix):
    rows = matrix.shape[0]
    cols = matrix.shape[1]
    
    frob_norm = 0.0
    for i in range(rows):
        for j in range(cols):
            frob_norm += matrix[i, j] ** 2
    
    return np.sqrt(frob_norm)

I promise--other than the @jit decorator on top of the function definition, the code for frob2 is identical to that of frob.

Let's test this out on the third and largest test data!

In [38]:
%timeit frob(x3)
1 loop, best of 3: 410 ms per loop
In [42]:
%timeit frob2(x3)
1000 loops, best of 3: 854 µs per loop

Woo! Got our three orders of magnitude back!

For the sake of completeness, let's see how this compares to a full NumPy array broadcasting version.

In [40]:
def frob3(matrix):
    s = (matrix ** 2).sum()
    return np.sqrt(s)
In [43]:
%timeit frob3(x3)
1000 loops, best of 3: 1.06 ms per loop

Yes, ladies and gentlemen: numba-optimized Python code is faster than you can get from doing everything right using NumPy magic.

Just-in-time compilation

numba works its magic by selectively compiling portions of Python code so they run really, really fast.

"Interpreted" languages versus "compiled" languages was one of the first concepts we discussed early on.

  • Interpreted languages "tend" to be easier to use, but run more slowly
  • Compiled languages "tend" to run really fast, but are harder to program in

The key phrase is tend to: packages like numba are blurring this demarcation.

Summary

  • You don't need to ditch Python if you need to run an alignment using PyMol--just use subprocess to call external programs (I use it to interface with our class' Slack chat through Python!).
  • If you're testing out a bunch of parameter values on the same data, don't just use a loop. It's an embarrassingly parallel problem, so use multiprocessing or joblib for magically fast speedups through parallelism.
  • If your code is still running slow, first make sure you're breaking it apart into functions. Then see if numba's just-in-time compiler can add any speed.

Administrivia

  • How is Assignment "456" going?
  • How are projects going?
  • Course evaluation website is up! Please submit! Any feedback will be helpful, even if it's "this class sucked because Python sucks" http://eval.franklin.uga.edu/ . The website is only open until April 26 (that's next Wednesday), so get your evaluations in sooner rather than later!

Additional Resources