CBIO (CSCI) 4835/6835: Introduction to Computational Biology
As a final lecture, we'll go over how to extend the reach of your Python code beyond the confines of the script itself and interact with the computer and other programs. Additionally, we'll look at ways of speeding up your Python code. By the end of this lecture, you should be able to:
subprocess
to invoke arbitrary external programsPython has lots of tools and packages to help you perform whatever analysis or function you want to do.
But sometimes you need to integrate with programs that don't have a Python interface.
Or maybe you just love the command prompt that much.
Especially in computational biology, there are frequent examples of needing to interface with programs outside of Python.
Python has a versatile subprocess
module for calling and interacting with other programs.
However, first the venerable system
command:
import os
os.system("curl www.cnn.com -o cnn.html")
0
For simple commands, this is great. But where it quickly wears out its welcome is how it handles what comes back from the commands: the return value of the system command is the exit code, not what is printed to screen.
f = open('cnn.html')
len(f.read())
136828
What exit code indicates success?
subprocess
¶The subprocess
module replaces the following modules (so don't use them):
os.system
os.spawn*
os.popen*
popen2.*
commands.*
Think of subprocess
as a more powerful version of all these.
The most basic function in the subprocess
module is run
.
The first and only required argument is a list: it's the command-line command and all its arguments.
Remember commands like ls
? cd
? pwd
? Think of these commands as functions--they can also have arguments.
subprocess.run
a list with one element: just the command you want to run.subprocess.run
a list with multiple elements, where the first element is the command to run, and the subsequent elements are the arguments to that command.Let's see some examples, shall we?
import subprocess
subprocess.run(["ls"])
CompletedProcess(args=['ls'], returncode=0)
subprocess.run(["touch", "test.txt"])
CompletedProcess(args=['touch', 'test.txt'], returncode=0)
subprocess.run(["echo", "something", ">>", "test.txt"])
CompletedProcess(args=['echo', 'something', '>>', 'test.txt'], returncode=0)
subprocess.run(["cat", "test.txt"])
CompletedProcess(args=['cat', 'test.txt'], returncode=0)
If there's some kind of oddity with the command you're trying to run, you'll get a nonzero exit code back.
subprocess.run(['ls','file with spaces'])
CompletedProcess(args=['ls', 'file with spaces'], returncode=0)
What's wrong here?
If the filename really has spaces, you need to "escape" the filename by using shell = True
:
subprocess.run(['ls', 'file with spaces'], shell = True)
CompletedProcess(args=['ls', 'file with spaces'], returncode=0)
If you're trying to run ls
on three separate files: file
, with
, and spaces
, you need to separate them into distinct list elements:
subprocess.run(['ls', 'file', 'with', 'spaces'])
CompletedProcess(args=['ls', 'file', 'with', 'spaces'], returncode=0)
This is all well and good--I can run commands and see whether or not they worked--but usually when you run external programs, it's to generate some kind of output that you'll then want Python to use.
How do we access this output?
First, we'll need to introduce a new concept to all programming languages: input and output streams.
Otherwise known as "standard input", "standard output", and "standard error", or in programming parlance:
stdin
- standard input is usually from the keyboardstdout
- standard output, usually from print()
statementsstderr
- standard error, usually from errorsstdin, stdout, and stderr specify the executed program's standard input, standard output and standard error file handles, respectively.
We have to redirect these streams within subprocess
so we can see them from inside Python.
f = open('dump','w')
subprocess.run('ls', stdout = f)
f = open('dump','r') #this would be a very inefficient way to get the stdout of a program
print(f.readlines())
['Final-Project.ipynb\n', 'Lecture1.ipynb\n', 'Lecture12.ipynb\n', 'Lecture13.ipynb\n', 'Lecture14\n', 'Lecture14.ipynb\n', 'Lecture15.ipynb\n', 'Lecture19\n', 'Lecture19.ipynb\n', 'Lecture2.ipynb\n', 'Lecture20\n', 'Lecture20.ipynb\n', 'Lecture21\n', 'Lecture21.ipynb\n', 'Lecture22\n', 'Lecture22.ipynb\n', 'Lecture23\n', 'Lecture23.ipynb\n', 'Lecture24.ipynb\n', 'Lecture26\n', 'Lecture26.ipynb\n', 'Lecture27.ipynb\n', 'Lecture3.ipynb\n', 'Lecture4.ipynb\n', 'Lecture5.ipynb\n', 'Lecture6\n', 'Lecture6.ipynb\n', 'Lecture7\n', 'Lecture7.ipynb\n', 'Lecture8.ipynb\n', 'Lecture9\n', 'Lecture9.ipynb\n', 'MA0004.1.sites\n', 'args.py\n', 'bnip3.fasta\n', 'brca1.fasta\n', 'cnn.html\n', 'dump\n', 'file\n', 'file with spaces\n', 'hydra.fasta\n', 'hydra179.dnd\n', 'p53.gb\n', 'spaces\n', 'test.txt\n', 'with\n']
f = open('dump','w')
subprocess.run(['ls','nonexistantfile'], stdout = f, stderr = subprocess.STDOUT) #you can redirect stderr to stdout
print(open('dump').read())
ls: nonexistantfile: No such file or directory
Yes, that's the error message from the command line. Rather than showing up on the command line, we've captured stderr
to be the stdout
of the Python subprocess, and then redirected the stdout
to the file. Hence, the error message is in the file!
What can the targets of the stdout
and stderr
arguments be? Valid targets, as we've seen, include
open()
stdin
/stdout
/stderr
subprocess.PIPE
, which enables communication directly between your script and the program (BE CAREFUL WITH THIS)subprocess.Popen
¶All the previous functions are just convenience wrappers around the Popen
object.
In addition to accepting the command and its arguments as a list, and the stdout
and stderr
optional arguments, Popen
includes the cwd
argument, which sets the working directory of the process, or defaults to the current working directory of the Python script.
proc = subprocess.Popen(["ls"], stdout = subprocess.PIPE)
print(proc.stdout.readlines())
[b'Final-Project.ipynb\n', b'Lecture1.ipynb\n', b'Lecture12.ipynb\n', b'Lecture13.ipynb\n', b'Lecture14\n', b'Lecture14.ipynb\n', b'Lecture15.ipynb\n', b'Lecture19\n', b'Lecture19.ipynb\n', b'Lecture2.ipynb\n', b'Lecture20\n', b'Lecture20.ipynb\n', b'Lecture21\n', b'Lecture21.ipynb\n', b'Lecture22\n', b'Lecture22.ipynb\n', b'Lecture23\n', b'Lecture23.ipynb\n', b'Lecture24.ipynb\n', b'Lecture26\n', b'Lecture26.ipynb\n', b'Lecture27.ipynb\n', b'Lecture3.ipynb\n', b'Lecture4.ipynb\n', b'Lecture5.ipynb\n', b'Lecture6\n', b'Lecture6.ipynb\n', b'Lecture7\n', b'Lecture7.ipynb\n', b'Lecture8.ipynb\n', b'Lecture9\n', b'Lecture9.ipynb\n', b'MA0004.1.sites\n', b'args.py\n', b'bnip3.fasta\n', b'brca1.fasta\n', b'cnn.html\n', b'dump\n', b'file\n', b'file with spaces\n', b'hydra.fasta\n', b'hydra179.dnd\n', b'p53.gb\n', b'spaces\n', b'test.txt\n', b'with\n']
proc = subprocess.Popen(['ls'], stdout = subprocess.PIPE, cwd = "/Users/squinn")
print(proc.stdout.readlines())
[b'Applications\n', b'Desktop\n', b'Documents\n', b'Downloads\n', b'Dropbox\n', b'Google Drive\n', b'Library\n', b'Movies\n', b'Music\n', b'Pictures\n', b'Programming\n', b'Public\n', b'SpiderOak Hive\n', b'metastore_db\n', b'nltk_data\n', b'rodeo.log\n']
subprocess.PIPE
¶So what is this mysterious PIPE
attribute?
"Pipes" are a common operating system term for avenues of communication between different programs. In this case, a pipe is established between your Python program and whatever program you're running through subprocess
.
If stdout
/stdin
/stderr
is set to subprocess.PIPE
, then that input/output stream of the external program is accessible through a file object in the returned object.
It's a regular ol' file object, so you have access to all the Python file object functions you know and love:
proc = subprocess.Popen(['ls'], stdout = subprocess.PIPE)
print(proc.stdout.readline())
b'Final-Project.ipynb\n'
print(proc.stdout.readline())
b'Lecture1.ipynb\n'
for elem in proc.stdout.readlines():
print(elem)
b'Lecture12.ipynb\n' b'Lecture13.ipynb\n' b'Lecture14\n' b'Lecture14.ipynb\n' b'Lecture15.ipynb\n' b'Lecture19\n' b'Lecture19.ipynb\n' b'Lecture2.ipynb\n' b'Lecture20\n' b'Lecture20.ipynb\n' b'Lecture21\n' b'Lecture21.ipynb\n' b'Lecture22\n' b'Lecture22.ipynb\n' b'Lecture23\n' b'Lecture23.ipynb\n' b'Lecture24.ipynb\n' b'Lecture26\n' b'Lecture26.ipynb\n' b'Lecture27.ipynb\n' b'Lecture3.ipynb\n' b'Lecture4.ipynb\n' b'Lecture5.ipynb\n' b'Lecture6\n' b'Lecture6.ipynb\n' b'Lecture7\n' b'Lecture7.ipynb\n' b'Lecture8.ipynb\n' b'Lecture9\n' b'Lecture9.ipynb\n' b'MA0004.1.sites\n' b'args.py\n' b'bnip3.fasta\n' b'brca1.fasta\n' b'cnn.html\n' b'dump\n' b'file\n' b'file with spaces\n' b'hydra.fasta\n' b'hydra179.dnd\n' b'p53.gb\n' b'spaces\n' b'test.txt\n' b'with\n'
Popen
functions¶Popen.poll()
- check to see if process has terminatedPopen.wait()
- wait for process to terminate (basically, ask your Python program to hang until the command is finished)Popen.terminate()
- terminate the process (ask nicely)Popen.kill()
- kill the process with extreme prejudiceproc = subprocess.Popen('cat', stdin = subprocess.PIPE, stdout = subprocess.PIPE)
We've created two pipes between our Python program and the cat
command--an input pipe to stdin
, and an output pipe from stdout
.
proc.stdin.write(bytes("Hello", encoding = 'utf-8'))
proc.stdin.close()
Here, we've written some raw bytes to the input--to the cat
program, this looks like it's coming from the keyboard!
print(proc.stdout.read())
b'Hello'
Now we're reading the output of the cat
program!
What would happen if in the previous code we omitted the proc.stdin.close()
call?
Managing simultaneous input and output is tricky and can easily lead to deadlocks.
For example, your script may be blocked waiting for output from the process which is blocked waiting for input.
There are several parallel programming models enabled by a variety of hardware (multicore, cloud computing, supercomputers, GPU).
A thread of execution is the smallest sequence of programmed instructions that can be managed independently by an operating system scheduler.
A process is an instance of a computer program.
The long and short of threads versus processes in Python is...
...always use processes.
Blah blah blah Global Interpreter Lock, blah blah blah only 1 thread allowed per Python process, blah blah blah just use multiprocessing.
Or multiprocessing.
This is the concept of parallel programming, or having your program do multiple things at the very same time.
With the rising popularity of multi-core computers, most computers these days can easily handle at least 4 parallel processes at once. Why not take advantage of that?
However, writing correct, high performance parallel code can be difficult: look no further than the Dining Philosophers problem.
...but in some cases, it's trivial.
A problem is embarassingly parallel if it can be easily separated into independent subtasks, each of which does a substantial amount of computation.
Fortunately, this happens a lot!
In cases like these, using Pools will get you a significant speedup (by however many cores you have).
multiprocessing
supports the concept of a pool of workers. You initialize with the number of processes you want to run in parallel (the default is the number of CPUs on your system) and they are available for doing parallel work:
Then (and this is the somewhat-tricky part), you call map()
and pass in two arguments:
Clear as mud? Let's see an example.
import multiprocessing
def f(x):
return x ** 2
pool = multiprocessing.Pool(processes = 4)
numbers_to_evaluate = range(20)
print(pool.map(f, numbers_to_evaluate))
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
While 90% of the work you'll likely do is embarrassingly parallel, some multiprocessing can't be done just with Pools.
Or, perhaps, you'll need to be able to communicate intermediate results between processes.
We can do this through queues and pipes.
multiprocessing.Queue
provides a simple first-in-first-out messaging queue between Python processes.
put
: put an element on the queue. This will block if the queue has filled upget
: get an element from the queue. This will block if the queue is empty. Queues are great if all you want to do is basically "report" on the progress of a process. The process put
s in updates, e.g. "20% finished", and the main Python script get
s these updates and prints them out to you.
A pipe is a communication channel between processes that can send and receive messages.
Pipe()
return a tuple of Connection
objects representing the ends of the pipe. Each connection object has the following methods:
send
: sends data to other end of the piperecv
: waits for data from other end of the pipe (unless pipe closed, then EOFError
)close
: close the pipeUnlike queues, pipes can support two-way communication between processes.
import multiprocessing
def chatty(conn): #this takes a Connection object representing one end of a pipe
msg = conn.recv()
conn.send("you sent me '" + msg + "'")
# Create the two ends of the pipe.
(c1,c2) = multiprocessing.Pipe()
# Spin off the process that runs the "Chatty" function.
p1 = multiprocessing.Process(target = chatty, args = (c2, ))
p1.start()
# Send a message to the process, and receive its response.
c1.send("Hello!")
result = c1.recv()
p1.join()
print(result)
you sent me 'Hello!'
joblib
¶joblib
is a wonderful package that uses multiprocessing
on the backend, but simplifies things greatly by removing a lot of boilerplate.
The most likely source of duplicated effort is in loops.
import numpy as np
def identity(value):
return np.sqrt(value ** 2)
# Now do some computation.
array = range(100000)
retval = []
for i in array:
retval.append(identity(i))
An important observation: no specific value of the array depends on any other. This means it doesn't matter the order in which these computations are performed. Plus it takes forever to run this on 100,000 numbers, one after another.
So why not perform them at the same time?
With multiprocessing
, we had to set up a Pool
and a map
. Not with joblib
:
from joblib import Parallel, delayed
retval = Parallel(n_jobs = 8, verbose = 1)(delayed(identity)(i) for i in array)
[Parallel(n_jobs=8)]: Done 10176 tasks | elapsed: 0.5s [Parallel(n_jobs=8)]: Done 100000 out of 100000 | elapsed: 0.9s finished
This is a bit tricky at first, but I promise it's more straightforward than multiprocessing
.
Let's take the code bit by bit.
retval = Parallel(n_jobs = 8, verbose = 1)(delayed(identity)(i) for i in array)
Parallel(n_jobs = 8, verbose = 1)
: This sets up the parallel computation, specifying we want to use 8 separate processes. The verbose
is just a logging argument--the higher the number, the more debugging output it spits out.delayed(identity)
: This is a little syntax trick by joblib
, but basically: whatever function you want to run in parallel, you pass in to the delayed()
function.(i)
: This is the argument you want to pass to the function you're running in parallel.for i in array
: and this is the loop through the data you want to process in parallel.All the same pieces are there as in multiprocessing
!
joblib
just streamlines the process by assuming loops are your primary source of repetition (a good assumption), so it bakes its machinery into the loop structure.
Anytime you do parameter scans, data point preprocessing, anything that is "embarrassingly parallel" or that would use multiprocessing.Pool
, use joblib
instead.
Just a quick look at one of the more cutting-edge Python packages: numba
Let's say you're trying to compute the Frobenius norm on an alignment matrix from a molecular dynamics simulation.
(that's just a fancy way of saying "element-wise Euclidean distance")
def frob(matrix):
rows = matrix.shape[0]
cols = matrix.shape[1]
frob_norm = 0.0
for i in range(rows):
for j in range(cols):
frob_norm += matrix[i, j] ** 2
return np.sqrt(frob_norm)
Let's see how it works.
import numpy as np
x1 = np.random.random((10, 10)) # A 10x10 random matrix
f1 = frob(x1)
print(f1)
5.71937796683
Cool. Seems to have worked reasonably well. Out of sheer curiosity, how long did that take to run?
%timeit frob(x1)
10000 loops, best of 3: 46 µs per loop
Not bad. $10^{-6}$ seconds per run.
How well does it scale if the matrix is an order of magnitude larger?
x2 = np.random.random((100, 100)) # A 100x100 random matrix!
f2 = frob(x2)
print(f2)
57.7465642187
%timeit frob(x2)
100 loops, best of 3: 4.12 ms per loop
Yikes--an order of magnitude in data size, but two orders of magnitude in runtime increase.
Let's try one more data size increase. I have a bad feeling about this...
x3 = np.random.random((1000, 1000)) # Yikes
f3 = frob(x3)
print(f3)
577.384362088
%timeit frob(x3)
1 loop, best of 3: 397 ms per loop
Another order of magnitude on the data, another two orders of magnitude on the runtime. Clearly not a good trend. Maybe a quadratic trend, in fact?
Point being, this code doesn't scale. At all.
Of course, the problem lies in the fact that you could be using NumPy array broadcasting. But let's say you didn't know about it.
Or, much more likely, it's a very small part of a much larger scientific program--complete with subprocesses and multiprocessing--and it's going to be tough to isolate a single part and optimize it.
"Just-in-time" compilation to the rescue!
from numba import jit
@jit
def frob2(matrix):
rows = matrix.shape[0]
cols = matrix.shape[1]
frob_norm = 0.0
for i in range(rows):
for j in range(cols):
frob_norm += matrix[i, j] ** 2
return np.sqrt(frob_norm)
I promise--other than the @jit
decorator on top of the function definition, the code for frob2
is identical to that of frob
.
Let's test this out on the third and largest test data!
%timeit frob(x3)
1 loop, best of 3: 410 ms per loop
%timeit frob2(x3)
1000 loops, best of 3: 854 µs per loop
Woo! Got our three orders of magnitude back!
For the sake of completeness, let's see how this compares to a full NumPy array broadcasting version.
def frob3(matrix):
s = (matrix ** 2).sum()
return np.sqrt(s)
%timeit frob3(x3)
1000 loops, best of 3: 1.06 ms per loop
Yes, ladies and gentlemen: numba
-optimized Python code is faster than you can get from doing everything right using NumPy magic.
numba
works its magic by selectively compiling portions of Python code so they run really, really fast.
"Interpreted" languages versus "compiled" languages was one of the first concepts we discussed early on.
The key phrase is tend to: packages like numba
are blurring this demarcation.
subprocess
to call external programs (I use it to interface with our class' Slack chat through Python!).multiprocessing
or joblib
for magically fast speedups through parallelism.numba
's just-in-time compiler can add any speed.