CBIO (CSCI) 4835/6835: Introduction to Computational Biology
In this lecture, we'll eschew all things Python and Biology, and focus entirely on the step before either of these: becoming familiar with the command line (or command prompt). By the end of this lecture, you should be able to:
If you've never used a command-line before... Don't be intimidated!
Other command prompts include
csh
(some would say the original: the "C-shell"bash
("bourne-again" shell; tends to be default on most Linux and macOS systems)ksh
(Korn shell)zsh
(Z shell)If you're on a Windows machine, you can either:
As of Windows 10, there is now a "bash subsystem" you can enable which is a fully-functional bash command prompt!
Follow these instructions: https://docs.microsoft.com/en-us/windows/wsl/install-win10
Once the subsystem is installed, you can configure it following these instructions: https://docs.microsoft.com/en-us/windows/wsl/initialize-distro
I'd highly recommend this! You'll be able to tinker with the command line through JupyterHub, but it's really nice to have on your own machine.
I have a macOS laptop, an Ubuntu workstation, a bunch of RedHat servers, and a Windows 10 home desktop.
I'm most at home with either macOS or Ubuntu.
It's like learning another language: you'll only get better at it if you immerse yourself in it, even when you don't want to.
You've fired up the command prompt (or Terminal
in macOS). How do you see what's in the current folder?
Last login: Mon Jan 9 18:36:07 on ttys006 example1:~ squinn$ ls Applications Dropbox Music SpiderOak Hive Desktop Google Drive Pictures metastore_db Documents Library Programming nltk_data Downloads Movies Public rodeo.log example1:~ squinn$
ls
¶Allows you to view the contents of the current directory--folders and files.
But how do we tell the difference between the two? Use an optional -l
flag.
example1:~ squinn$ ls -l total 264 drwx------ 7 squinn staff 238 Oct 23 2015 Applications drwx------+ 59 squinn staff 2006 Jan 9 17:49 Desktop drwx------+ 20 squinn staff 680 Dec 23 09:35 Documents drwx------+ 5 squinn staff 170 Jan 9 18:27 Downloads drwx------@ 17 squinn staff 578 Jan 8 18:03 Dropbox drwx------@ 49 squinn staff 1666 Jan 4 15:47 Google Drive drwx------+ 74 squinn staff 2516 Nov 17 15:06 Library drwx------+ 6 squinn staff 204 May 20 2015 Movies drwx------+ 5 squinn staff 170 Oct 22 2014 Music drwx------+ 18 squinn staff 612 Jul 29 11:31 Pictures drwxr-xr-x 37 squinn staff 1258 Jan 4 15:57 Programming drwxr-xr-x+ 5 squinn staff 170 Oct 21 2014 Public drwx------@ 8 squinn staff 272 Jun 30 2015 SpiderOak Hive drwxr-xr-x 9 squinn staff 306 Sep 17 2015 metastore_db drwxr-xr-x 4 squinn staff 136 Apr 27 2016 nltk_data -rw-r--r-- 1 squinn staff 131269 Jan 9 18:32 rodeo.log example1:~ squinn$
Anything that starts with a d
on the left is a folder (or directory), otherwise it's a file.
Ok, that's cool. I can tell what is what where I currently am. ...but wait, how do I even know where I am?
example1:~ squinn$ pwd /home/squinn example1:~ squinn$
pwd
¶Pretty straightforward--stands for Print Working Directory. Gives you the full path to where you are currently working. Not really any other needed optional flags.
Great! Now I know where I am, and what is what where I am. How do I move somewhere else?
This would be the same thing as double-clicking a folder to move into it.
example1:~ squinn$ cd Music/ example1:Music squinn$ ls iTunes example1:Music squinn$
You'll notice the output of the ls
command has now changed, which hopefully isn't surprising.
Since we've Changed Directories with the cd
command--you essentially double-clicked the "Music" folder--now we're in a different folder with different contents; in this case, a lone "iTunes" folder.
Folders within folders represent a recursive hierarchy. We won't delve too much into this concept, except to say that, unless you're in the root directory (/
on Linux, C:\
on Windows), there is always a parent directory--the enclosing folder around the folder you are currently in.
This gives rise to a hierarchical structure of folders.
Therefore, while you can always change to a very specific directory by supplying the full path--
example1:~ squinn$ cd /home/squinn/Dropbox example1:Dropbox squinn$ ls Cilia_Papers Imaging_Papers OdorAnalysis Public Computer Case LandUseChange OrNet cilia movies Icon? NSF_BigData_2015 OrNet Videos example1:Dropbox squinn$
--I can also navigate to the parent folder of my current location, irrespective of my specific location, using the special ..
notation.
cd ..
¶Takes you up one level to the parent directory of where you currently are.
example1:Dropbox squinn$ pwd /home/squinn/Dropbox example1:Dropbox squinn$ cd .. example1:~ squinn$ pwd /home/squinn example1:~ squinn$
It's important to distinguish relative versus absolute paths when you're changing directories.
..
" path is a good example: it means "go to the parent directory of wherever I am now". That "wherever I am now" is the relative part, so typing "cd ..
" will put you in a different parent directory depending on where you start.cd /home/squinn
" will always always take me to the "/home/squinn
" folder, no matter where I was when I typed the command...
/tmp/file.txt
some/folder/file.txt
Found the pattern yet?
Anything with a preceding /
is an absolute path; otherwise, it is considered relative.
Let's see some other examples!
example1: squinn$ ls Lecture1.ipynb example1: squinn$ ls -l total 40 -rw-r--r-- 1 squinn staff 18620 Jan 5 19:54 Lecture1.ipynb example1: squinn$ pwd /home/squinn/teaching/4835/lectures example1: squinn$ cd .. example1: squinn$ pwd
What prints out?
~/
/home/squinn
/home/squinn/teaching
/home/squinn/teaching/4835
$ ls -l total 8 -rw-rw-r-- 1 squinn staff 19 Sep 3 09:08 hello.txt drwxrwxr-x 2 squinn staff 4096 Sep 3 09:08 lecture $ ls *.txt
What prints out?
hello.txt
*.txt
hello.txt lecture
du - disk usage of files/directores
[squinn tmp]$ du -s 146564 . [squinn tmp]$ du -sh 144M . [squinn tmp]$ du -sh intro 4.0K intro
df - usage of full disk
[squinn tmp]$ df -h . Filesystem Size Used Avail Use% Mounted on pulsar:/home 37T 28T 9.3T 75% /net/pulsar/home
From time to time, you'll probably be wondering where you can find certain things.
updatedb
first, and every time you install something new. But it's super efficient.python
) and want to know where it's installed on your computer. You can saw which python
and it will tell you.man <command>
to learn more than you ever cared to know about how to use it!"Environment variables" are ways of storing little bits of information for you to re-use in your commands. We'll see a version of this in Python, too!
NAME=value
: set NAME equal to value No spaces around equals
export NAME=value
: set NAME equal to value and make it stick
\$
: dereference variable. This is a fancy way of saying "use this variable in something useful." As an example:
$> X=3 $> echo $X 3 $> X=hello $> echo $X hello $> echo X X
Which does not print the value of X?
echo $X
echo ${X}
echo '$X'
echo "$X"
It's a little tricky! The key is to notice the little "backtick" operators--these things "`"--that enclose the command you want to run.
Then mentally substitute that command in for its variable, and hopefully you kinda see how it works.
env list all set environment variables (the variables we defined from before)
PATH where shell searches for commands. It is VERY IMPORTANT that you leave this variable alone unless you know what you're doing. Although it's kind of inevitable that everyone who uses the command line nukes this variable accidentally at least once...
LD_LIBRARY_PATH library search path (don't worry about this too much)
PYTHONPATH where python searches for modules (also don't worry about this, especially if you just use JupyterHub for everything)
.bashrc initialization file for bash - set PATH etc. This is like a "startup" file that's run every time you open the command prompt, so if there are any variables you want to take on specific values every single time you open up a terminal, this is the place to set them.
After you've been using the terminal for awhile, you may want to look at all the commands you've run! Or perhaps it took you awhile to figure out exactly how to run a command, and a few days/weeks/months later you find you need to run that command again. There are ways of looking into your command history:
history show commands previously issued
up arrow cycle through previous commands
Ctrl-R search through history for command AWESOME
.bash_history file that stores the history
HISTCONTROL environment variable that sets history options: ignoredups
HISTSIZE size of history buffer
Some clever shortcuts on the command line. By definition, they're not necessary, but they can make your work more efficient.
Tab autocomplete
Ctrl-D EOF/logout/exit
Ctrl-A go to beginning of line
Ctrl-E go to end of line
alias new=cmd. This is a fun one: if you don't like how certain commands are named, you can make up your own!
make a nickname for a command $ alias l='ls -l' $ alias $ l total 40 -rw-r--r-- 1 squinn staff 18620 Jan 5 19:54 Lecture1.ipynb
which is the output of ls -l
that we saw earlier!
The first word you type is the program you want to run. bash will search PATH for an appropriately named executable and run it with the specified arguments.
Some example commands we'll be using a lot:
First, a quick review: some of the commands we just covered that will be coming back here.
"I/O" is a bit of technical slang for "input/output". It refers to the things that go into a program, and the things that come back out.
For example, the command echo "Hello, world!"
can be considered a program--"echo
"--with input "Hello, world!"
. It's just a type of program where the input and output are the same.
With the command line, we therefore refer to specific concepts known as standard input and standard output. Don't worry, this is pretty straightforward:
So when you type echo "Hello, world!"
, the input is provided through standard input (since you typed it), and the output is sent through standard output (since it appeared in the command prompt right below the command you typed).
We can, if we want, redirect those inputs and outputs. For example: if we have a long sequence of programs, perhaps we want the standard output of one program to be the input to the next one.
> send standard output to file
$ echo Hello > h.txt
>> append to file
$ echo World >> h.txt
A quick exercise!
$ echo Hello > h.txt $ echo World >> h.txt $ cat h.txt
What prints out?
A slightly different example:
$ echo Hello > h.txt $ echo World > h.txt $ cat h.txt
What prints out?
A pipe (|) redirects the standard output of one program to the standard input of another. It's like you typed the output of the first program into the second. This allows us to chain several simple programs together to do something more complicated.
wc
is a very useful little command for "word counting".
$ wc h.txt 1 2 14 h.txt
This means the text in h.txt
has 1 line, 2 words, and 14 letters total.
But rather than feed the text into a file first, we could echo it and pipe it directly into the wc
command as input:
$ echo hello, world! | wc 1 2 14
Same output! But no filename, because we skipped that step.
A frequent part of computational biology is, sadly, futzing with data in text files. But with a little bit of knowledge of text-processing command line programs, you can do a lot of this without even using something like Python.
Some programs we've seen:
cat
: dump file to stdoutmore
: paginated output (a nicer version of cat
, basically)head
: show first 10 linestail
: show last 10 lineswc
: count lines/words/charactersSome new ones:
sort
: sort file by line and print out (-n for numerical sort)uniq
: remove adjacent duplicates (-c to count occurances)cut
: extract fixed width columns from fileHere's a fun example that ties some of these commands up using pipes:
$ cat text a b a b b $ cat text | uniq | wc
What is the first number to print out?
A slight wrinkle:
$ cat text a b a b b $ cat text | sort | uniq | wc
What is the first number to print out?
It's great to be able to dump out text using cat
, count words and letters with wc
, and sort
and uniq
-ify the outputs.
But what if I wanted to search a text file for a specific word? Or extract the third word of every line in a file? Or pull out every line that started with the word "The"?
These are all a bit beyond the basic utilities we've described, so we have to explore some more advanced programs.
In particular, we're looking at these three commands:
grep
: search contents of file for expressionsed
: stream editor - perform substitutionsawk
: pattern scanning and processing, great for dealing with data in columnsWe'll spend a slide on each one.
This searches a file's contents for whatever pattern you specify. The command is organized this way:
pattern
can be any collection of letters that you want to match. The lines where the matches are found (if any) are printed to standard out if grep
finds any.
grep
also has a few flags to change how it works:
-r
: recursive search-I
: skip over binary files-s
: suppress error messages-n
: show line numbers-A
: show N lines after match-B
: show N lines before matchYay exercise!
$ cat text a b a b b $ grep a text | wc
What is the first number to print out?
sed
is a great search and replace command. If you've ever used "Find and Replace" in Microsoft Word or something like that, this is the command-line version.
It works with this pattern:
sed 's/pattern/replacement/' file
-i
: replace in-place (overwrites input file; the default behavior without this flag is to print the changes to standard output and leave the original file as-is)Basically, I'm searching the file for letters or words that exactly match pattern
, and then replacing those with whatever I put in replacement
.
So! How about an exercise?
$ sed 's/a/b/' text | uniq | wc
What is the first number to print out?
This is the single most advanced utility we'll cover for the command line. I'm not overstating to say knowledge and especially mastery of awk
gets you jobs.
awk
is pretty much a fully-functioning language unto itself, built for scanning, processing, and transforming text data. It's largely used to extract rows or columns of data according to potentially-complex conditions, but it can do just about anything you'd like, really.
It processes a file line-by-line, and if a condition holds true, it runs a simple program on the text in that line.
awk 'optional condition {awk program}' file
Before we get to exercises, let's try a few things out.
$ cat names id last,first 1 Smith,Alice 2 Jones,Bob 3 Smith,Charlie
Try these:
$ awk '{print $1}' names $ awk -F, '{print $2}' names $ awk 'NR > 1 {print $2}' names $ awk '$1 > 1 {print $0}' names $ awk 'NR > 1 {print $2}' names | awk -F, '{print $1}' | sort | uniq -c
Hint: $1
, $2
, and similar refer to columns in the text (0-indexed!)
Let's walk through a couple of examples with some real files. You can download these off the course website. In fact, you can even do it from the command line with wget
!
mkdir intro cd intro wget https://eds-uga.github.io/cbio4835-fa18/files/Spellman.csv wget https://eds-uga.github.io/cbio4835-fa18/files/1shs.pdb
wc Spellman.csv (gives number of lines, because of header this is off by one) grep YA Spellman.csv |wc grep ^YA Spellman.csv |wc (this is a bit better, ^ matches begining of line) grep ^YA -c Spellman.csv (grep can provide the count itself) awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-2 | sort | uniq -c awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-3 | sort | uniq -c awk -F, 'NR > 1 && $2 > 0 {print $0}' Spellman.csv | wc awk -F, 'NR > 1 {print $1,$2}' Spellman.csv | sort -k2,2 -n | tail awk -F, 'NR > 1 {print $1,$2}' Spellman.csv | sort -k2,2 -n -r | tail awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $0}' Spellman.csv |wc awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $4-$2,$0}' Spellman.csv | sort -n -k1,1
grep ^ATOM 1shs.pdb > newpdb.pdb (^matches beginning of line) grep ^ATOM 1shs.pdb | awk '$5 == "A" {print $0}' #this is UNSAFE with pdb files since there is no guarantee that fields #will be whitespace seperated, safer is: grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' > newpdb.pdb grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' | cut -b 78- | sort | uniq -c