CBIO (CSCI) 4835/6835: Introduction to Computational Biology
BioPython is a versatile Python package for computational biology, particularly if your interest is in sequence analysis. In addition to performing multiple alignments, it can download sequences from the internet, construct phylogenetic trees, and even run population genetics simulations. By the end of this lecture, you should be able to:
Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.
Other modules that might be of interest:
Recall from lectures gone by: http://www.ncbi.nlm.nih.gov/gene/5216
First line is description of sequence and starts with >
All lines up to the next > are part of the same sequence
Usually fewer than 80 characters per line
>gi|568815581:c4949086-4945650 Homo sapiens chromosome 17, GRCh38.p2 Primary Assembly
CCCGCAGGGTCCACACGGGTCGGGCCGGGCGCGCTCCCGTGCAGCCGGCTCCGGCCCCGACCGCCCCATG
CACTCCCGGCCCCGGCGCAGGCGCAGGCGCGGGCACACGCGCCGCCGCCCGCCGGTCCTTCCCTTCGGCG
GAGGTGGGGGAAGGAGGAGTCATCCCGTTTAACCCTGGGCTCCCCGAACTCTCCTTAATTTGCTAAATTT
GCAGCTTGCTAATTCCTCCTGCTTTCTCCTTCCTTCCTTCTTCTGGCTCACTCCCTGCCCCGATACCAAA
GTCTGGTTTATATTCAGTGCAAATTGGAGCAAACCCTACCCTTCACCTCTCTCCCGCCACCCCCCATCCT
TCTGCATTGCTTTCCATCGAACTCTGCAAATTTTGCAATAGGGGGAGGGATTTTTAAAATTGCATTTGCA
Annotated format. Starts with LOCUS
field. Can have several other annotation (e.g. KEYWORDS
, SOURCE
, REFERENCE
, FEATURES
).
ORIGIN
record indicates start of sequence
'\\
' indicates the end of sequence
LOCUS CAG28598 140 aa linear PRI 16-OCT-2008
DEFINITION PFN1, partial [Homo sapiens].
ACCESSION CAG28598
VERSION CAG28598.1 GI:47115277
DBSOURCE embl accession CR407670.1
KEYWORDS .
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
ORIGIN
1 magwnayidn lmadgtcqda aivgykdsps vwaavpgktf vnitpaevgv lvgkdrssfy
61 vngltlggqk csvirdsllq dgefsmdlrt kstggaptfn vtvtktdktl vllmgkegvh
121 gglinkkcye mashlrrsqy
//
Let's dive into BioPython, shall we?
from Bio.Seq import Seq # the submodule structure of biopython is a little awkward
s = Seq("GATTACA")
print(s)
GATTACA
Sequences act a lot like strings, but have an associated alphabet and additional methods.
Methods shared with str
: count
, endswith
, find
, lower
, lstrip
, rfind
, split
, startswith
, strip
,upper
Seq
-specific methods: back_transcribe
, complement
, reverse_complement
, tomutable
, tostring
, transcribe
, translate
, ungap
s.alphabet
Alphabet()
The alphabet of a sequence defines what the characters mean
from Bio.Alphabet import IUPAC
# Who is IUPAC?
The sequence objects in BioPython, or Seq
objects, act a lot like Python strings.
s[0] # This returns a string
'G'
s[2:4] # This returns sequence
Seq('TT', Alphabet())
s.lower()
Seq('gattaca', Alphabet())
s + s
Seq('GATTACAGATTACA', Alphabet())
Another blast from the past. But now we can do this using Python.
DNA coding strand (aka Crick strand, strand +1)
5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’
|||||||||||||||||||||||||||||||||||||||
3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’
DNA template strand (aka Watson strand, strand −1)
|
Transcription
↓
5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’
Single stranded messenger RNA
|
Translation
↓
MAIVMGR*KGAR*
amino acid sequence (w/stop codons)
dna = Seq('GATTACAGATTACAGATTACA')
print(dna.complement(), dna.reverse_complement())
CTAATGTCTAATGTCTAATGT TGTAATCTGTAATCTGTAATC
Seq
objects may largely behave like strings, but they have all sorts of additional methods that make them extremely useful and convenient for sequence analysis.
print(dna)
GATTACAGATTACAGATTACA
rna = dna.transcribe() # No need for a transcriptase!
print(rna)
GAUUACAGAUUACAGAUUACA
protein = rna.translate()
print(protein)
DYRLQIT
print(dna.translate()) # Can go straight from the DNA to amino acid sequences.
DYRLQIT
Yes, plural. Did you think there was only one codon table?
By default the standard translation table is used, but others can be provided to the translate
method.
from Bio.Data import CodonTable
print(sorted(CodonTable.unambiguous_dna_by_name.keys()))
['Alternative Flatworm Mitochondrial', 'Alternative Yeast Nuclear', 'Archaeal', 'Ascidian Mitochondrial', 'Bacterial', 'Blepharisma Macronuclear', 'Candidate Division SR1', 'Chlorophycean Mitochondrial', 'Ciliate Nuclear', 'Coelenterate Mitochondrial', 'Dasycladacean Nuclear', 'Echinoderm Mitochondrial', 'Euplotid Nuclear', 'Flatworm Mitochondrial', 'Gracilibacteria', 'Hexamita Nuclear', 'Invertebrate Mitochondrial', 'Mold Mitochondrial', 'Mycoplasma', 'Pachysolen tannophilus Nuclear Code', 'Plant Plastid', 'Protozoan Mitochondrial', 'Pterobranchia Mitochondrial', 'SGC0', 'SGC1', 'SGC2', 'SGC3', 'SGC4', 'SGC5', 'SGC8', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Spiroplasma', 'Standard', 'Thraustochytrium Mitochondrial', 'Trematode Mitochondrial', 'Vertebrate Mitochondrial', 'Yeast Mitochondrial']
print(CodonTable.unambiguous_dna_by_name['Standard']) # What does this structure look like?
Table 1 Standard, SGC0 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+--
Sequence data is read/written as SeqRecord
objects.
These objects store additional information about the sequence (name, id, description, features).
SeqIO
reads sequence records:
read
to read a file with a single recordparse
to iterate over file with mulitple recordsPython file I/O!
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
seq = SeqIO.read('p53.gb','genbank')
seqs = []
# http://eds-uga.github.io/cbio4835-sp17/files/p53.gb
# http://eds-uga.github.io/cbio4835-sp17/files/hydra.fasta
for s in SeqIO.parse('hydra.fasta','fasta'):
seqs.append(s)
print(len(seqs))
40
print(seq)
ID: NG_017013.2 Name: NG_017013 Description: Homo sapiens tumor protein p53 (TP53), RefSeqGene (LRG_321) on chromosome 17. Number of features: 37 /topology=linear /comment=REVIEWED REFSEQ: This record has been curated by NCBI staff in collaboration with Magali Olivier, Thierry Soussi, Jean-Christophe Bourdon, Nazneen Rahman. The reference sequence was derived from AC087388.9 and AC007421.13. This sequence is a reference standard in the RefSeqGene project. On Apr 3, 2012 this sequence version replaced gi:293651587. Summary: This gene encodes a tumor suppressor protein containing transcriptional activation, DNA binding, and oligomerization domains. The encoded protein responds to diverse cellular stresses to regulate expression of target genes, thereby inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. Mutations in this gene are associated with a variety of human cancers, including hereditary cancers such as Li-Fraumeni syndrome. Alternative splicing of this gene and the use of alternate promoters result in multiple transcript variants and isoforms. Additional isoforms have also been shown to result from the use of alternate translation initiation codons (PMIDs: 12032546, 20937277). [provided by RefSeq, Feb 2013]. Publication Note: This RefSeq record includes a subset of the publications that are available for this gene. Please see the Gene record to access additional publications. COMPLETENESS: complete on the 3' end. /source=Homo sapiens (human) /sequence_version=2 /data_file_division=PRI /date=18-MAY-2014 /gi=383209646 /accessions=['NG_017013', 'REGION:', '5001..24149'] /references=[Reference(title='G-quadruplex structures in TP53 intron 3: role in alternative splicing and in production of p53 mRNA isoforms', ...), Reference(title='Delta160p53 is a novel N-terminal p53 isoform encoded by Delta133p53 transcript', ...), Reference(title='Does the nonsense-mediated mRNA decay mechanism prevent the synthesis of truncated BRCA1, CHK2, and p53 proteins?', ...), Reference(title='p53 Family isoforms', ...), Reference(title='Li-Fraumeni Syndrome', ...), Reference(title='Characterization of the human p53 gene', ...), Reference(title='Localization of gene for human p53 tumour antigen to band 17p13', ...), Reference(title='Transformation associated p53 protein is encoded by a gene on human chromosome 17', ...), Reference(title='Molecular cloning and in vitro expression of a cDNA clone for human cellular tumor antigen p53', ...), Reference(title='Human p53 cellular tumor antigen: cDNA sequence and expression in COS cells', ...)] /keywords=['RefSeq', 'RefSeqGene'] /organism=Homo sapiens /taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo'] Seq('GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTT...GTG', IUPACAmbiguousDNA())
These files are available through the course website and repository if you'd like to use them.
However, BioPython lets you fetch sequences automatically directly off the internet, through an interface to the NCBI "Entrez" search engine.
The results of these queries are returned as file-like objects.
Just import the Entrez
interface:
from Bio import Entrez
Entrez.email = "squinn@cs.uga.edu" # BioPython forces you to provide your email
res = Entrez.read(Entrez.einfo()) # The names of all available databases
print(res)
{'DbList': ['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'unigene', 'gencoll', 'gtr']}
print(sorted(res['DbList']))
['annotinfo', 'assembly', 'bioproject', 'biosample', 'biosystems', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'dbvar', 'gap', 'gapplus', 'gds', 'gencoll', 'gene', 'genome', 'geoprofiles', 'grasp', 'gtr', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'nuccore', 'nucest', 'nucgss', 'nucleotide', 'omim', 'orgtrack', 'pcassay', 'pccompound', 'pcsubstance', 'pmc', 'popset', 'probe', 'protein', 'proteinclusters', 'pubmed', 'pubmedhealth', 'seqannot', 'snp', 'sparcle', 'sra', 'structure', 'taxonomy', 'unigene']
You can search any of the available databases for a given term, and it will return the IDs of all the relevant records.
This is done through the esearch
method in the Entrez
submodule.
result = Entrez.esearch(db = 'nucleotide', term = 'tp53')
# The result is a file-like object of the raw XML data
records = Entrez.read(result)
# This puts the object into a more palatable form (dictionary)
print(records)
{'RetMax': '20', 'Count': '7457', 'TranslationStack': [{'Term': 'tp53[All Fields]', 'Field': 'All Fields', 'Explode': 'N', 'Count': '7457'}, 'GROUP'], 'TranslationSet': [], 'IdList': ['123959767', '54792141', '1147690151', '1147390868', '1147371848', '1147371063', '1147370158', '1147370155', '1147370152', '1147284217', '1147284026', '1147281628', '1147255345', '1147255343', '1147255170', '1143017308', '1143016088', '1143015864', '1143014988', '1143014939'], 'QueryTranslation': 'tp53[All Fields]', 'RetStart': '0'}
print(records['IdList'])
print(len(records['IdList']))
['123959767', '54792141', '1147690151', '1147390868', '1147371848', '1147371063', '1147370158', '1147370155', '1147370152', '1147284217', '1147284026', '1147281628', '1147255345', '1147255343', '1147255170', '1143017308', '1143016088', '1143015864', '1143014988', '1143014939'] 20
In the previous slide, only 20 results were actually returned. However, there were considerably more hits than that--2,923 to be exact.
They weren't returned [immediately] because of the online nature of the query. Luckily, this is something we can tweak.
records = Entrez.read(Entrez.esearch(db = 'nucleotide', term = 'tp53', retmax = 50))
print(records)
print(len(records['IdList']))
{'RetMax': '50', 'Count': '7457', 'TranslationStack': [{'Term': 'tp53[All Fields]', 'Field': 'All Fields', 'Explode': 'N', 'Count': '7457'}, 'GROUP'], 'TranslationSet': [], 'IdList': ['123959767', '54792141', '1147690151', '1147390868', '1147371848', '1147371063', '1147370158', '1147370155', '1147370152', '1147284217', '1147284026', '1147281628', '1147255345', '1147255343', '1147255170', '1143017308', '1143016088', '1143015864', '1143014988', '1143014939', '1143014901', '1143012217', '1142984843', '908661396', '1143421917', '1143420384', '1143409309', '1143406388', '1143406386', '1143382759', '1137562549', '1137562520', '1137562503', '1137561523', '1137561198', '685156931', '685156929', '571026673', '571026671', '205361188', '1133563937', '1133563800', '1139437910', '1139435148', '1139424333', '761388684', '1137549668', '1137549667', '1137549666', '1137547598'], 'QueryTranslation': 'tp53[All Fields]', 'RetStart': '0'} 50
To get the full record for a given ID, we'll use the efetch
function.
To use this function, we have to provide rettype
(available options include FASTA and gb).
We also have to specify retmode
, which can be plan text or XML.
# Fetch the genbank file for the 10th ID from our search
result = Entrez.efetch(db = "nucleotide", id = records['IdList'][10], rettype = "gb", retmode = 'text')
# Parse into a seqrecord
p53 = SeqIO.read(result,'gb')
print(p53)
ID: XM_020168006.1 Name: XM_020168006 Description: PREDICTED: Castor canadensis TP53 target 5 (Tp53tg5), mRNA. Database cross-references: BioProject:PRJNA371604 Number of features: 3 /topology=linear /structured_comment=defaultdict(<class 'dict'>, {'Genome-Annotation-Data': {'Annotation Software Version': '7.3', 'Annotation Method': 'Best-placed RefSeq; Gnomon', 'Annotation Provider': 'NCBI', 'Annotation Version': 'Castor canadensis Annotation Release', 'Annotation Status': 'Full annotation', 'Features Annotated': 'Gene; mRNA; CDS; ncRNA', 'Annotation Pipeline': 'NCBI eukaryotic genome annotation'}, 'RefSeq-Attributes': {'internal stop codons': 'corrected 1 genomic stop codon', 'ab initio': '8% of CDS bases', 'frameshifts': 'corrected 1 indel'}}) /comment=MODEL REFSEQ: This record is predicted by automated computational analysis. This record is derived from a genomic sequence (NW_017870857.1) annotated using gene prediction method: Gnomon. Also see: Documentation of NCBI's Annotation Process 100 pipeline /accessions=['XM_020168006'] /source=Castor canadensis (American beaver) /sequence_version=1 /keywords=['RefSeq', 'corrected model', 'includes ab initio'] /taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Glires', 'Rodentia', 'Sciurognathi', 'Castoridae', 'Castor'] /organism=Castor canadensis /data_file_division=ROD /date=15-FEB-2017 Seq('TTAAGAATTTCTTTATTCCATCACTTAGAAAAGAACCTGGCATGCAGCAATACA...AAA', IUPACAmbiguousDNA())
Genbank files are typically annotated with a ton of additional features that refer to portions of the full sequence.
SeqRecord
objects track these features, allowing you to extract the corresponding subsequence.
p53.features
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1806), strand=1), type='source'), SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1806), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(778), ExactPosition(1645), strand=1), type='CDS')]
(CDS is a coding sequence!)
You can then extract these subsequences like you would an element from a list:
cdsfeature = p53.features[2]
print(cdsfeature)
type: CDS location: [778:1645](+) qualifiers: Key: codon_start, Value: ['1'] Key: db_xref, Value: ['GeneID:109689248'] Key: gene, Value: ['Tp53tg5'] Key: note, Value: ['The sequence of the model RefSeq protein was modified relative to its source genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon; substituted 1 base at 1 genomic stop codon'] Key: product, Value: ['LOW QUALITY PROTEIN: TP53-target gene 5 protein'] Key: protein_id, Value: ['XP_020023595.1'] Key: transl_except, Value: ['(pos:1547..1549,aa:OTHER)'] Key: translation, Value: ['MQDQKLRDNIKEPASKLIELSRLKMVLRNLSLLKLFKSSNGRIQELHHLARKCWNSILRFPKILEISSGLYSIIMRIMGLTQLSWDQGLNPALLTVFHLSVRDNNICNKVKENNELQEARSLEKQMESTREPEETRPKNLRPKVGTKAKRSPVAEPWEEKQMQPEDPRTSSTGAHRRQLHTGGSHVIFLKTSHHRTPMGDMKLLDVTDKSIWFEGLPTRIHLPGPRIMCRSSNLRWVKRCCTRFCSASLELSMCHPXVDVTVPLPGPGWRSXNQLNGRDGRENSRVYK']
Using the feature, you can extract the subsequence from the original full sequence.
coding = cdsfeature.extract(p53) # Pass the full record (p53) to the feature
print(coding)
ID: XM_020168006.1 Name: XM_020168006 Description: PREDICTED: Castor canadensis TP53 target 5 (Tp53tg5), mRNA. Number of features: 1 Seq('ATGCAAGACCAGAAACTGCGGGACAACATAAAGGAGCCTGCCAGCAAATTAATT...TAA', IUPACAmbiguousDNA())
BioPython also allows you to use NCBI's BLAST to search for similar sequences with the qblast
function.
This function has three required arguments:
A couple of examples:
from Bio.Blast import NCBIWWW
result = NCBIWWW.qblast("blastn", "nt", coding.seq)
# The variable "result" is a file-like object with XML in it
# BE WARNED: it can take a while to get results!
# Note whether there is a NUMBER or a STAR next to your JupyterHub cell!
from Bio.Blast import NCBIXML # For parsing xmls
blast_records = NCBIXML.read(result)
print(len(blast_records.alignments), len(blast_records.descriptions))
50 50
alignment = blast_records.alignments[1]
print(len(alignment.hsps))
1
hsp = alignment.hsps[0] #high scoring segment pairs
print('****Alignment****')
print('sequence:', alignment.title)
print('length:', alignment.length)
print('e value:', hsp.expect)
print(hsp.query[0:75] + '...') # what we searched with
print(hsp.match[0:75] + '...')
print(hsp.sbjct[0:75] + '...') # what we matched to
****Alignment**** sequence: gi|744560427|ref|XM_010978666.1| PREDICTED: Camelus dromedarius TP53 target 5 (TP53TG5), mRNA length: 1297 e value: 3.59825e-115 CAAGACCAGAAACTGCGGGACAACATAAAGGAGCCTGCCAGCAAATTAATTGAACTGAGCCGACTTAAAATGGTG... ||||| || ||| | |||||| ||| || |||||| ||||||| ||||||| | || ||||||||||||||||... CAAGATGAGGAACCACAGGACAAGATACAGCAGCCTGTCAGCAAAGTAATTGAGCGGAACCGACTTAAAATGGTG...
Biopython can use lots of different alignment algorithms, but it includes a pre-packaged that works decently well.
from Bio import pairwise2
alignments = pairwise2.align.globalxx("ACCGT", "ACG")
This will return the list of alignments for the two sequences.
print(pairwise2.format_alignment(*alignments[0]))
print(pairwise2.format_alignment(*alignments[1]))
ACCGT ||||| A-CG- Score=3 ACCGT ||||| AC-G- Score=3
BioPython has a bunch of other alignment tools built-in, but they rely on external programs (some are HMM-based!).
A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities—their phylogeny—based upon similarities and differences in their physical or genetic characteristics. --Wikipedia
Biopython can read a variety of tree formats: Newick (clustal), NEXUS, phyloXML, NeXML, and CDAO.
from Bio import Phylo
tree = Phylo.read('hydra179.dnd','newick') # Must specify format.
tree.rooted = True
tree
Tree(rooted=True, weight=1.0)
Displaying the full phylogenetic trees can get a little interesting.
First, we can visualize them as pure ASCII, as the early internet would have intended:
Phylo.draw_ascii(tree)
______ gi|302171738|gb|ADK97770.1| ____| | |____ gi|302171740|gb|ADK97771.1| _____| | | , gi|313105485|gb|ADR32101.1| | |_______________________________| | | gi|313105490|gb|ADR32105.1| | | , gi|225423246|gb|ACN91129.1| | | | | gi|407380197|gb|AFU11414.1| | | | |, gi|302171754|gb|ADK97778.1| | || | || gi|407380047|gb|AFU11341.1| | | | ,, gi|407380097|gb|AFU11366.1| | || | || gi|407380101|gb|AFU11368.1| | || | || gi|407380017|gb|AFU11326.1| | || | || gi|407380135|gb|AFU11385.1| | || | || gi|407380117|gb|AFU11376.1| | | | |, gi|407380023|gb|AFU11329.1| | || | || gi|407380045|gb|AFU11340.1| | || | || gi|302171762|gb|ADK97782.1| | | | |_ gi|407380093|gb|AFU11364.1| | | | | , gi|302171792|gb|ADK97797.1| | | | | |,| gi|302171794|gb|ADK97798.1| | ||| | ||| gi|225423258|gb|ACN91135.1| | || | ,| gi|407380183|gb|AFU11407.1| | || | || gi|302171760|gb|ADK97781.1| | | | |__ gi|225423252|gb|ACN91132.1| | | | | , gi|407380005|gb|AFU11320.1| | | | | |_| gi|407380157|gb|AFU11396.1| | | | | | | gi|407380057|gb|AFU11346.1| | | | | , gi|302171782|gb|ADK97792.1| | | | | | | gi|407380151|gb|AFU11393.1| | | | __| |,| gi|407380113|gb|AFU11374.1| | | ,||| | | |||| gi|407380085|gb|AFU11360.1| | | ||| | | |||, gi|407380139|gb|AFU11387.1| | | |||| | | |||| gi|407380181|gb|AFU11406.1| | | ||| | | ||| gi|407380035|gb|AFU11335.1| | | ||| | | |||_ gi|407380123|gb|AFU11379.1| | | ||| | | ||| gi|407380187|gb|AFU11409.1| | | || | | ||, gi|225423250|gb|ACN91131.1| | | ||| | | |,| gi|407380176|gb|AFU11404.1| | | ||| | | ||| gi|407380103|gb|AFU11369.1| | | || | | || gi|407379993|gb|AFU11314.1| | | | | | | , gi|225423248|gb|ACN91130.1| | | | | | | | | gi|225423296|gb|ACN91154.1| | | | | | | | | gi|225423298|gb|ACN91155.1| | | | | | | | ,| gi|302171786|gb|ADK97794.1| | | | || | | |,|| gi|302171788|gb|ADK97795.1| | | ||| | | ||| gi|407380051|gb|AFU11343.1| | | || | | || , gi|225423280|gb|ACN91146.1| | | || | | | ||,| gi|302171758|gb|ADK97780.1| | | |||| | | |||| gi|407380185|gb|AFU11408.1| | | | | | | | |, gi|407380007|gb|AFU11321.1| | | | || | | | | gi|407380105|gb|AFU11370.1| | | | | | |_ gi|407380133|gb|AFU11384.1| | | | ,| | | , gi|225423276|gb|ACN91144.1| || | | | || | | | gi|407380001|gb|AFU11318.1| || | | | || | | | gi|302171752|gb|ADK97777.1| || | | | || | | | gi|313105489|gb|ADR32104.1| || | | | || |___| | gi|407380003|gb|AFU11319.1| || | | || | , gi|225423300|gb|ACN91156.1| || | | || | | gi|302171790|gb|ADK97796.1| || | | || | ,| gi|407380031|gb|AFU11333.1| || | || || | ||, gi|407380077|gb|AFU11356.1| || | |,| || | ||| gi|407380079|gb|AFU11357.1| || | || || | || gi|407380015|gb|AFU11325.1| || | | || | |, gi|302171764|gb|ADK97783.1| || |,|| || |||| gi|407380171|gb|AFU11402.1| || ||,| || |||, gi|407380033|gb|AFU11334.1| || |||| || |||| gi|407380083|gb|AFU11359.1| || ||| || |||_ gi|407380013|gb|AFU11324.1| || ||| || ||| , gi|407380037|gb|AFU11336.1| || |||,| ,|| |||| gi|407380039|gb|AFU11337.1| ||| |,| ||| ||, gi|407380173|gb|AFU11403.1| ||| ||| ||| ||| gi|407380210|gb|AFU11420.1| ||| || ||| ||_ gi|407380075|gb|AFU11355.1| ||| || ||| || gi|407380189|gb|AFU11410.1| ||| | ||| |_ gi|407380143|gb|AFU11389.1| ||| |||__ gi|302171734|gb|ADK97768.1| ||| |||___ gi|302171736|gb|ADK97769.1| || || , gi|225423290|gb|ACN91151.1| || | || | gi|407380025|gb|AFU11330.1| || | || | gi|302171774|gb|ADK97788.1| || | ||,| gi|225423244|gb|ACN91128.1| |||| ||||, gi|407380155|gb|AFU11395.1| |,||| ||| | gi|407380179|gb|AFU11405.1| ||| ||| gi|225423272|gb|ACN91142.1| || || , gi|407380061|gb|AFU11348.1| ||,| |||| gi|407380205|gb|AFU11418.1| ||| |||, gi|407380069|gb|AFU11352.1| |||| || | gi|407380195|gb|AFU11413.1| || ||_ gi|407380019|gb|AFU11327.1| || ||_ gi|407380053|gb|AFU11344.1| | | , gi|225423284|gb|ACN91148.1| | | |_| gi|407380109|gb|AFU11372.1| | | | | gi|302171750|gb|ADK97776.1| | | , gi|302171784|gb|ADK97793.1| | | | | gi|407380089|gb|AFU11362.1| | | |,| gi|225423254|gb|ACN91133.1| ||| ,|| gi|225423294|gb|ACN91153.1| || || gi|407379991|gb|AFU11313.1| | |, gi|225423260|gb|ACN91136.1| || || gi|407380131|gb|AFU11383.1| || ||, gi|407380027|gb|AFU11331.1| ||| ||| gi|407380029|gb|AFU11332.1| ||| |,| gi|302171778|gb|ADK97790.1| ||| ||| gi|225423266|gb|ACN91139.1| || ||, gi|225423292|gb|ACN91152.1| ||| ||| gi|302171776|gb|ADK97789.1| || |, gi|407380055|gb|AFU11345.1| || || gi|407380214|gb|AFU11422.1| || ||, gi|407380041|gb|AFU11338.1| ||| ||| gi|407380149|gb|AFU11392.1| ||| ||| gi|407380169|gb|AFU11401.1| ||| ,|| gi|225423268|gb|ACN91140.1| || ||, gi|302171770|gb|ADK97786.1| |,| ||| gi|407380203|gb|AFU11417.1| || ||, gi|407379997|gb|AFU11316.1| ||| ||| gi|407380009|gb|AFU11322.1| ||| ||| gi|407380141|gb|AFU11388.1| ||| ||| gi|407380011|gb|AFU11323.1| || |, gi|407380021|gb|AFU11328.1| || || gi|407380191|gb|AFU11411.1| || || gi|407379989|gb|AFU11312.1| | |_ gi|407380095|gb|AFU11365.1| | | , gi|225423270|gb|ACN91141.1| | ,| |,|| gi|407380193|gb|AFU11412.1| ||| ||| gi|407380147|gb|AFU11391.1| || ||, gi|225423288|gb|ACN91150.1| ||| ||| gi|302171768|gb|ADK97785.1| ||| ||, gi|407380043|gb|AFU11339.1| |,| ||| gi|407380111|gb|AFU11373.1| ||| ||| gi|407380212|gb|AFU11421.1| || || gi|407380067|gb|AFU11351.1| || || gi|302171756|gb|ADK97779.1| || ||, gi|407380063|gb|AFU11349.1| |,| ||| gi|407380145|gb|AFU11390.1| || || gi|407380160|gb|AFU11397.1| | |, gi|302171746|gb|ADK97774.1| || || gi|407380137|gb|AFU11386.1| | |, gi|225423286|gb|ACN91149.1| || || gi|302171766|gb|ADK97784.1| || || gi|225423256|gb|ACN91134.1| || || gi|407380164|gb|AFU11399.1| || || gi|407380199|gb|AFU11415.1| || |, gi|225423282|gb|ACN91147.1| || || gi|302171748|gb|ADK97775.1| || ,| gi|407379995|gb|AFU11315.1| || || gi|407379987|gb|AFU11311.1| || || gi|407380065|gb|AFU11350.1| | |, gi|302171772|gb|ADK97787.1| || || gi|407380127|gb|AFU11381.1| ,| |, gi|407380099|gb|AFU11367.1| || || gi|407380129|gb|AFU11382.1| | |, gi|407380119|gb|AFU11377.1| || || gi|407380121|gb|AFU11378.1| || || gi|407380081|gb|AFU11358.1| | , gi|302171780|gb|ADK97791.1| | | gi|407380073|gb|AFU11354.1| | , gi|407380107|gb|AFU11371.1| | | gi|407380153|gb|AFU11394.1| | , gi|407380115|gb|AFU11375.1| | | gi|407380201|gb|AFU11416.1| | | ___ model1_F.pdb _| _| |,| | gi|407380166|gb|AFU11400.1| ||| ||| gi|407380162|gb|AFU11398.1| || || gi|407379985|gb|AFU11310.1| || ,| gi|407380059|gb|AFU11347.1| || || gi|164609123|gb|ABY62783.1| | | gi|225423262|gb|ACN91137.1| | | gi|407380049|gb|AFU11342.1| | | gi|407380208|gb|AFU11419.1| | , gi|407380071|gb|AFU11353.1| | | gi|407380091|gb|AFU11363.1| | |, gi|225423274|gb|ACN91143.1| || || gi|225423278|gb|ACN91145.1| || || gi|302171742|gb|ADK97772.1| || ,| gi|302171744|gb|ADK97773.1| || || gi|313105486|gb|ADR32102.1| | | gi|225423264|gb|ACN91138.1| | , gi|407379999|gb|AFU11317.1| | | gi|407380087|gb|AFU11361.1| | | gi|407380125|gb|AFU11380.1|
With some plotting tools, however, we can make wonderful image visualizations of the tree.
%matplotlib inline
Phylo.draw(tree)
...maybe more "wonderful" than wonderful at first, given how much information there is.
Let's re-draw the previous tree, this time without tree labels, just to see how the branching looks.
Phylo.draw(tree, label_func = lambda x: None) # Don't worry about what this does
Now we're getting somewhere! We can clearly see some structure in the respective evolutionary offshoots.
Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function. Often they indicate sequence-specific binding sites for proteins such as nucleases and transcription factors (TF). --The Internet
from Bio import motifs # Lowercase for some reason...
m = motifs.create(["TACAA","CATGC","TACTA","CCCAA"])
Remember the "profile" from profile HMMs? It looked something like this:
print(m.counts)
0 1 2 3 4 A: 0.00 3.00 0.00 2.00 3.00 C: 2.00 1.00 3.00 0.00 1.00 G: 0.00 0.00 0.00 1.00 0.00 T: 2.00 0.00 1.00 1.00 0.00
We can visualize this information in different ways.
For instance, we can look at the consensus sequence, which is just the sequence formed by picking the nucleotide at each position that is most frequent.
print(m.counts)
print(m.consensus)
0 1 2 3 4 A: 0.00 3.00 0.00 2.00 3.00 C: 2.00 1.00 3.00 0.00 1.00 G: 0.00 0.00 0.00 1.00 0.00 T: 2.00 0.00 1.00 1.00 0.00 TACAA
We can also generate logos!
m.weblogo('logo.png', stack_width = 'large')
from IPython.display import Image
Image(filename = 'logo.png')
Remember this visualization from the last lecture? The relative heights of each letter are a measure of how frequently those nucleotides show up at those positions in the aligned sequence.
Biopython supports a number of motif formats: JASPAR, MEME, TRANSFAC.
These formats are associated with databases (JASPAR, TRANSFAC) and tools (MEME).
Of particular interest are sequence motifs for transcription factor binding.
f = open('MA0004.1.sites') # Unlike other parts of Biopython, can't just provide filename to open.
arnt = motifs.read(f,'sites') # JASPAR sites
print(arnt)
TF name None Matrix ID None Matrix: 0 1 2 3 4 5 A: 4.00 19.00 0.00 0.00 0.00 0.00 C: 16.00 0.00 20.00 0.00 0.00 0.00 G: 0.00 1.00 0.00 20.00 0.00 20.00 T: 0.00 0.00 0.00 0.00 20.00 0.00
Now we can examine all sorts of properties associated with the motif.
First, what motifs are even in this file?
print(arnt.instances)
CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG CACGTG AACGTG AACGTG AACGTG AACGTG CGCGTG
Next, what is the profile matrix of the motifs?
print(arnt.counts)
0 1 2 3 4 5 A: 4.00 19.00 0.00 0.00 0.00 0.00 C: 16.00 0.00 20.00 0.00 0.00 0.00 G: 0.00 1.00 0.00 20.00 0.00 20.00 T: 0.00 0.00 0.00 0.00 20.00 0.00
And the consensus sequence?
print(arnt.consensus)
CACGTG
We've touched on this idea in previous lectures--the idea that not all mismatches in sequence alignments are created equal--but BioPython has this concept built-in.
For one, we can normalize the counts to be probabilities instead, allowing you to compare multiple motifs:
print(arnt.counts.normalize())
0 1 2 3 4 5 A: 0.20 0.95 0.00 0.00 0.00 0.00 C: 0.80 0.00 1.00 0.00 0.00 0.00 G: 0.00 0.05 0.00 1.00 0.00 1.00 T: 0.00 0.00 0.00 0.00 1.00 0.00
An interesting question with normalized counts: what happens if the count is 0? Is the probability of seeing that nucleotide at that position also 0?
(Hint as to where this is going: how do you compute the probabilities of a bunch of statistically independent events? what happens if one of those events has a probability of 0?)
It probably doesn't have a probability of 0, but it may be close to 0. In machine learning, there's a concept known as "pseudocounting" in which a very small number of counts are given to variables whose probabilities we want to keep close to 0, but not exactly 0.
We can do that in BioPython, too:
print(arnt.counts.normalize(pseudocounts = 0.8))
0 1 2 3 4 5 A: 0.21 0.85 0.03 0.03 0.03 0.03 C: 0.72 0.03 0.90 0.03 0.03 0.03 G: 0.03 0.08 0.03 0.90 0.03 0.90 T: 0.03 0.03 0.03 0.03 0.90 0.03
Finally, we can use BioPython to search for motifs in larger sequences.
from Bio import SeqIO
from Bio.Seq import Seq
largeseq = SeqIO.read('bnip3.fasta', 'fasta', arnt.alphabet).seq # Load with same alphabet as motif
smallseq = Seq('AAACCCACGTGACTATATA')
We can search for exact matches:
exact_matches = arnt.instances.search(smallseq)
for match in exact_matches:
position = match[0]
sequence = match[1]
print(position, sequence)
5 CACGTG
exact_matches = arnt.instances.search(largeseq)
for match in exact_matches:
position = match[0]
sequence = match[1]
print(position, sequence)
3452 CACGTG 4058 CACGTG 6181 AACGTG 8591 CGCGTG 10719 CACGTG 10998 CACGTG
We can also use a PSSM to search for "fuzzy" matches--instances where we don't find an exact match of the thing we're looking for, but rather something that's "close" (for some definition of "close").
Basically, this makes use of the counts matrix we saw before.
If we're looking for a specific motif that isn't found exactly, we can specify a count threshold at which mismatches in certain positions are allowed to count toward search hits.
Here's an example:
# First, normalize to probabilities and use pseudocounts.
pwm = arnt.counts.normalize(pseudocounts = 0.8)
# Next, take the log of all the numbers.
pssm = pwm.log_odds()
# Now, run a search for the motif, using the scoring matrix and a threshold.
results = []
threshold = 4
fuzzy_matches = pssm.search(largeseq, threshold)
for match in fuzzy_matches:
results.append(match)
print(len(results), "fuzzy matches found with threshold of", threshold)
167 fuzzy matches found with threshold of 4
The score is a $log_2$ likelihood, so a score of 4 is $2^4=16$ times more likely to occur as part of the motif than as part of the (assumed uniform random) background.