Lecture 2: Command Line Basics

CBIO (CSCI) 4835/6835: Introduction to Computational Biology

Overview and Objectives

In this lecture, we'll eschew all things Python and Biology, and focus entirely on the step before either of these: becoming familiar with the command line (or command prompt). By the end of this lecture, you should be able to:

  • Name the different kinds of command line "shells"
  • Navigate through the folders of a filesystem
  • Perform basic text parsing using bash commands

Part 1: BASH Basics

If you've never used a command-line before... Don't be intimidated!

Bash is to command prompts as Windows is to operating systems

Other command prompts include

  • csh (some would say the original: the "C-shell"
  • bash ("bourne-again" shell; tends to be default on most Linux and macOS systems)
  • ksh (Korn shell)
  • zsh (Z shell)

Think of the fancy point-and-click user-interfaces as running commands on a prompt behind-the-scenes whenever you click something

I highly recommend either Linux (Ubuntu, Mint, RedHat) or macOS. The Windows MS-DOS prompt is something else entirely.

If you're on a Windows machine, you can either:

I have a macOS laptop, an Ubuntu workstation, a bunch of RedHat servers, and a Windows 10 home desktop.

I'm most at home with either macOS or Ubuntu.

It's like learning another language: you'll only get better at it if you immerse yourself in it, even when you don't want to.

Diving in!

You've fired up the command prompt (or Terminal in macOS). How do you see what's in the current folder?

Last login: Mon Jan  9 18:36:07 on ttys006
example1:~ squinn$ ls
Applications   Dropbox        Music          SpiderOak Hive
Desktop        Google Drive   Pictures       metastore_db
Documents      Library        Programming    nltk_data
Downloads      Movies         Public         rodeo.log
example1:~ squinn$ 

ls

Allows you to view the contents of the current directory--folders and files.

But how do we tell the difference between the two? Use an optional -l flag.

(aside: "flags" are options to commands that slightly tweak their behavior to account for different user intentions--like "quit" versus "force quit")

example1:~ squinn$ ls -l
total 264
drwx------   7 squinn  staff     238 Oct 23  2015 Applications
drwx------+ 59 squinn  staff    2006 Jan  9 17:49 Desktop
drwx------+ 20 squinn  staff     680 Dec 23 09:35 Documents
drwx------+  5 squinn  staff     170 Jan  9 18:27 Downloads
drwx------@ 17 squinn  staff     578 Jan  8 18:03 Dropbox
drwx------@ 49 squinn  staff    1666 Jan  4 15:47 Google Drive
drwx------+ 74 squinn  staff    2516 Nov 17 15:06 Library
drwx------+  6 squinn  staff     204 May 20  2015 Movies
drwx------+  5 squinn  staff     170 Oct 22  2014 Music
drwx------+ 18 squinn  staff     612 Jul 29 11:31 Pictures
drwxr-xr-x  37 squinn  staff    1258 Jan  4 15:57 Programming
drwxr-xr-x+  5 squinn  staff     170 Oct 21  2014 Public
drwx------@  8 squinn  staff     272 Jun 30  2015 SpiderOak Hive
drwxr-xr-x   9 squinn  staff     306 Sep 17  2015 metastore_db
drwxr-xr-x   4 squinn  staff     136 Apr 27  2016 nltk_data
-rw-r--r--   1 squinn  staff  131269 Jan  9 18:32 rodeo.log
example1:~ squinn$

Anything that starts with a d on the left is a folder (or directory), otherwise it's a file.

Ok, that's cool. I can tell what is what where I currently am. ...but wait, how do I even know where I am?

example1:~ squinn$ pwd
/home/squinn
example1:~ squinn$

pwd

Pretty straightforward--stands for Print Wworking Directory. Gives you the full path to where you are currently working. Not really any other needed optional flags.

Great! Now I know where I am, and what is what where I am. How do I move somewhere else?

example1:~ squinn$ cd Music/
example1:Music squinn$ ls
iTunes
example1:Music squinn$ 

You'll notice the output of the ls command has now changed, which hopefully isn't surprising.

Since we've Changed Directories with the cd command--you essentially double-clicked the "Music" folder--now we're in a different folder with different contents; in this case, a lone "iTunes" folder.

Folders within folders represent a recursive hierarchy. We won't delve too much into this concept, except to say that, unless you're in the root directory (/ on Linux, C:\ on Windows), there is always a parent directory--the enclosing folder around the folder you are currently in.

Therefore, while you can always change to a very specific directory by supplying the full path--

example1:~ squinn$ cd /home/squinn/Dropbox
example1:Dropbox squinn$ ls
Cilia_Papers     Imaging_Papers   OdorAnalysis     Public
Computer Case    LandUseChange    OrNet            cilia movies
Icon?            NSF_BigData_2015 OrNet Videos
example1:Dropbox squinn$

--I can also navigate to the parent folder of my current location, irrespective of my specific location, using the special .. notation.

cd ..

Takes you up one level to the parent directory of where you currently are.

example1:Dropbox squinn$ pwd
/home/squinn/Dropbox
example1:Dropbox squinn$ cd ..
example1:~ squinn$ pwd
/home/squinn
example1:~ squinn$

Let's see some other examples!

example1: squinn$ ls
Lecture1.ipynb
example1: squinn$ ls -l
total 40
-rw-r--r--  1 squinn  staff  18620 Jan  5 19:54 Lecture1.ipynb
example1: squinn$ pwd
/home/squinn/teaching/4835/lectures
example1: squinn$ cd ..
example1: squinn$ pwd

What prints out?

  • ~/
  • /home/squinn
  • /home/squinn/teaching
  • /home/squinn/teaching/4835
  • An Error
$ ls -l
total 8
-rw-rw-r-- 1 squinn staff   19 Sep  3 09:08 hello.txt
drwxrwxr-x 2 squinn staff 4096 Sep  3 09:08 lecture
$ ls *.txt

What prints out?

  • hello.txt
  • *.txt
  • hello.txt lecture
  • An Error

Spacing Out

du - disk usage of files/directores

[squinn tmp]$ du -s
146564    .
[squinn tmp]$ du -sh
144M    .
[squinn tmp]$ du -sh intro
4.0K    intro

df - usage of full disk

[squinn tmp]$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
pulsar:/home     37T   28T  9.3T  75% /net/pulsar/home

Dude, where's my stuff?

locate find a file system wide

find search directory tree

which print location of a command

man print manual page of a command

Save the Environment

NAME=value set NAME equal to value No spaces around equals

export NAME=value set NAME equal to value and make it stick

\$ dereference variable

$ X=3
$ echo $X
3
$ X=hello
$ echo $X
hello
$ echo X
X

Getting at your variables

Which does not print the value of X?

  • echo $X
  • echo ${X}
  • echo '$X'
  • echo "$X"

Capturing Output

`cmd` evaluates to output of cmd

$ FILES=`ls`
$ echo $FILES 
hello.txt lecture

Your Environment

env list all set environment variables

PATH where shell searches for commands

LD_LIBRARY_PATH library search path

PYTHONPATH where python searches for modules

.bashrc initialization file for bash - set PATH etc here

History

history show commands previously issued

up arrow cycle through previous commands

Ctrl-R search through history for command AWESOME

.bash_history file that stores the history

HISTCONTROL environment variable that sets history options: ignoredups

HISTSIZE size of history buffer

Shortcuts

Tab autocomplete Ctrl-D EOF/logout/exit Ctrl-A go to beginning of line Ctrl-E go to end of line alias new=cmd
make a nickname for a command
$ alias l='ls -l'
$ alias
$ l

Commands

The first word you type is the program you want to run. bash will search PATH for an appropriately named executable and run it with the specified arguments.

  • ipython - start interactive python shell (more later)
  • ssh hostname - connect to hostname
  • passwd - change your password
  • nano - a user-friendly text editor

ssh into jupyterhub.cs.uga.edu and change your password

Part 2: Text Manipulation

Review

ls - list files

cd - change directory

pwd - print working (current) directory

.. - special file that refers to parent directory

. - the current directory

cat file - print out contents of file

more file - print contents of file with pagination

I/O Redirection

> send standard output to file

$ echo Hello > h.txt

>> append to file

$ echo World >> h.txt

< send file to standard input of command

2> send standard error to file

>& send output and error to file

$ echo Hello > h.txt
$ echo World >> h.txt
$ cat h.txt

What prints out?

  • Hello
  • World
  • HelloWorld

  • Hello
    World
  • An Error
$ echo Hello > h.txt
$ echo World > h.txt
$ cat h.txt

What prints out?

  • Hello
  • World
  • HelloWorld

  • Hello
    World
  • An Error

Pipes

A pipe (|) redirects the standard output of one program to the standard input of another. It's like you typed the output of the first program into the second. This allows us to chain several simple programs together to do something more complicated.

$ echo Hello World | wc

Simple Text Manipulation

cat dump file to stdout

more paginated output

head show first 10 lines

tail show last 10 lines

wc count lines/words/characters

sort sort file by line and print out (-n for numerical sort)

uniq remove adjacent duplicates (-c to count occurances)

cut extract fixed width columns from file

$ cat text
a
b
a
b
b
$ cat text | uniq | wc

What is the first number to print out?

  • 1
  • 2
  • 3
  • 4
  • 5
  • None of the above
$ cat text
a
b
a
b
b
$ cat text | sort | uniq | wc

What is the first number to print out?

  • 1
  • 2
  • 3
  • 4
  • 5
  • None of the above

Advanced Text Manipulation

grep search contents of file for expression

sed stream editor - perform substitutions

awk pattern scanning and processing, great for dealing with data in columns

grep

Search file contents for a pattern.

grep pattern file(s)

  • ‐r recursive search
  • ‐I skip over binary files
  • ‐s suppress error messages
  • ‐n show line numbers
  • ‐A N show N lines after match
  • ‐B N show N lines before match
$ grep a text | wc

What is the first number to print out?

  • 1
  • 2
  • 3
  • 4
  • 5
  • None of the above

sed

Search and replace

sed 's/pattern/replacement/' file
  • ‐i replace in-place (overwrites input file)
$ sed 's/a/b/' text | uniq | wc

What is the first number to print out?

  • 1
  • 2
  • 3
  • 4
  • 5
  • None of the above

awk

Pattern scanning in processing language. We'll mostly use it to extract columns/fields. It processes a file line-by-line and if a condition holds runs a simple program on the line.

awk 'optional condition {awk program}' file

  • -Fx make x the field deliminator (default whitespace)
  • NF number of fields on current line
  • NR current record number
  • \$0 full line
  • \$N Nth field

awk

$ cat names
id last,first 
1 Smith,Alice
2 Jones,Bob
3 Smith,Charlie
Try these:
$ awk '{print $1}' names
$ awk -F, '{print $2}' names
$ awk 'NR > 1 {print $2}' names 
$ awk '$1 > 1 {print $0}' names
$ awk 'NR > 1 {print $2}' names | awk -F, '{print $1}' | sort | uniq -c

Exercises

mkdir intro
cd intro
wget https://eds-uga.github.io/cbio4835-sp17/files/Spellman.csv
wget https://eds-uga.github.io/cbio4835-sp17/files/1shs.pdb
  • How many data points are in Spellman.csv?
  • The first three letters of the systematic open reading frames are: 'Y' for yeast, the chromosome number, then the chromosome arm. In the dataset, how many ORFs from chromosome A are there?
  • How many are there from each chromosome?
    • each chromosome arm?
  • How many data points start with a positive expression value?
  • What are the 10 data points with the highest initial expression values?
    • Lowest?
  • How many lines are there where expression values are continuously increasing for the first 3 time steps?
  • Sorted by biggest increase?
wc Spellman.csv   (gives number of lines, because of header this is off by one)
grep YA Spellman.csv |wc
grep ^YA Spellman.csv |wc  (this is a bit better, ^ matches begining of line)
grep ^YA -c Spellman.csv  (grep can provide the count itself)
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-2 | sort | uniq -c
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-3 | sort | uniq -c
awk -F, 'NR > 1 && $2 > 0 {print $0}' Spellman.csv | wc
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n | tail
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n -r | tail
awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $0}' Spellman.csv  |wc
awk -F, 'NR > 1 && $3 > $2 && $4 > $3  {print $4-$2,$0}' Spellman.csv   | sort -n -k1,1

More Exercises

  • Create a pdb file from 1shs that consists of only ATOM records.
  • Create a pdb with only ATOM records from chain A.
  • How many carbon atoms are in this file?
grep ^ATOM 1shs.pdb > newpdb.pdb (^matches beginning of line)
grep ^ATOM 1shs.pdb | awk '$5 == "A" {print $0}'
#this is UNSAFE with pdb files since there is no guarantee that fields
#will be whitespace seperated, safer is:
grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' > newpdb.pdb

grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' | cut -b 78- | sort | uniq -c

Administrivia