CBIO (CSCI) 4835/6835: Introduction to Computational Biology
Today, we'll cover our first true computational modeling algorithm: Hidden Markov Models (HMMs). These are very sophisticated techniques for learning and predicting information that comes to you in some kind of sequence, whether it's an amino acid primary structure, a time-lapse video of molecules being trafficked through cells, or the construction schedule of certain buildings on the UGA campus. By the end of this lecture, you should be able to:
CG
-islands and Casinos¶CG
is typically underrepresented, clocking in at a frequency considerably less than the "expected" $\frac{1}{16}$CG-islands
¶CG
is the least frequent dinucleotide (why?).
C
is easily methylated, after which it has a tendency to mutate into a T
.CG
will appear at relatively high frequencies within these CG
-islands.Finding CG
-islands, therefore, is an important biological problem!
Let's say you've wandered into a casino, hoping to make enough money to keep you from ever having to write another sequence alignment function using dynamic programming.
This is the Fair Bet Casino: the game is to flip two coins: a fair coin (F) and a biased coin (B).
How can we tell whether the dealer is using F or B?
Given: A sequence of coin flips, such as HHHHHHHHHH
or HTHTHTHTHT
.
We need to be able to "grade" the sequence of coin flips and assign the "most likely" corresponding sequence of Fair or Biased coins making each flip.
We are, in effect, decoding the observed sequence (Heads or Tails) to recover the underlying state (Fair or Biased).
Let's start with an easy case: the dealer never switches coins. Our problem can then be formalized as follows:
We can compute this directly!
Let's say we have
$P(x | F) = \prod_{i = 1}^n \frac{1}{2} = \left(\frac{1}{2}\right)^k$
So what would the probability using a Biased coin be?
$P(x | B) = \prod_{i = 1}^n \left(\frac{3}{4}\right)^k \left(\frac{1}{2}\right)^{n - k} = \frac{3^k}{4^n}$
So if we made a couple of examples...
Example 1: HTHTHT
has a length of 6, with 3 Heads and 3 Tails.
Which is more likely?
Example 2: HHHHHH
has a length of 6, with 6 Heads and 0 Tails.
Which is more likely?
Pause: any questions?
HMMs can be viewed as "machines" with $k$ hidden states.
Depending on what state the HMM is in, it can "emit" different outputs from an alphabet of all possible outputs--we'll call this alphabet $\Sigma$.
While in a certain state, the HMM asks itself two questions:
The hidden states (which give HMMs their name, Hidden Markov Models) are the key component and the part of HMMs that make building them tricky.
We don't know whether the dealer is using a Fair or Biased coin; we don't know whether this portion of an amino acid sequence is an $\alpha$-helix or a $\beta$-sheet; we don't know whether this portion of the genome is a CG
island.
The goal of decoding with HMMs is to make the best possible guess as to the sequence of hidden states, given the sequence of emitted symbols.
Its parameters are to an HMM what our genomes are to us: they're essentially the genetic makeup that distinguishes one HMM from another.
A
matrix¶This encodes the probability of switching between hidden states. Using our Fair Bet Casino example, there are only two hidden states, so the matrix will be $2 \times 2$.
$$ A = \begin{pmatrix} 0.9 & 0.1 \\ 0.1 & 0.9 \end{pmatrix} $$
E
matrix¶This matrix encodes the probabilities of emitting certain symbols, given that the HMM is in a certain hidden state. Using our Fair Bet Casino example, with two hidden states and two possible output symbols, this matrix is also $2 \times 2$.
$$ E = \begin{pmatrix} 0.5 & 0.5 \\ 0.25 & 0.75 \end{pmatrix} $$
Taken together, the parameters of our Fair Bet Casino HMM would look something like this:
Another perhaps more intuitive perspective would be to view an HMM as a "state machine": each possible state of the HMM is represented as a node in a graph, with arrows connecting the nodes that are weighted based on their probabilities.
The observable sequence of emitted symbols (coin flips, amino acids, nucleotides, etc) is one sequence. However, the corresponding sequence of hidden states is known as the hidden paths.
This is usually represented as $\vec{\pi} = \pi_1, ..., \pi_n$.
Consider the sequence $x =$ 01011101001
and the hidden path $\vec{\pi} =$ FFFBBBBBFFF
. This is how we would list out the probabilities:
This is the question you'll most often ask of HMMs: given a set of observed sequences, find the optimal hidden path that would have generated the observed sequences.
Formally, this is defined as the following:
Any ideas?
Andrew Viterbi used the the Manhattan edit graph model to solve the decoding problem!
Yep, dynamic programming is back!
Use the same matrix abstraction as before, except instead we only have one sequence of length $n$, and then the other axis is used for the $k$ hidden states.
It looks something like this:
So, instead of having one of 3 decisions at each vertex--insertion, deletion, or match/mutation--we have one of $k$ decisions:
Each possible path through the graph has probability $P(x | \pi)$.
The Viterbi algorithm finds the path that gives the maximal possible $P(x | \pi)$, over all possible paths through the graph.
This is the question you'll ask of HMMs second in frequency only to the decoding problem: given a set of observed sequences, compute the probability the system was in a certain state at a certain time.
Formally, this is defined as the following:
Any ideas?
Dunno, but this question is answered using a combination of two algorithms: the forward algorithm, and the backward algorithm.
First, let's define the forward probability as $f_{k, i}$: the probability of emitting a prefix $x_1, ..., x_i$ and reaching state $\pi = k$.
Put another way, this means
[Hopefully] Easy way to remember: we've fixed the first $i$ symbols, but we have to move forward to see the rest.
Next, we'll define the backward probability as $b_{k, i}$: the probability of being in state $\pi_i = k$, and emitting the suffix $x_{i + 1}, ..., x_n$.
Put another way, we're now looking at
[Hopefully] Easy way to remember: we've fixed the last $n - i$ symbols, but have to move backward to get the probability of state $k$.
Asking at any given coin flip whether the casino dealer is using a fair or biased coin is, in fact, a function of both the forward and the backward probabilities.
Formally:
(does this look familiar???)
Let's say we're interested in finding a distant cousin of functionally related protein sequences in a family.
This family has diverged so much that pairwise (e.g. Hamming) comparisons are weak, and may therefore fail a statistical significance test.
However, these weak similarities may show up across many members of the family, indicating a correlation.
The goal, then, is to align all members of the family at once.
A family of related proteins can be represented by their multiple alignment and the corresponding profile (hence, the name).
Aligned DNA sequences can be represented as a $4 \times n$ profile matrix, reflecting the frequencies of nucleotides in every aligned position.
Similarly, a protein family can be represented using a $20 \times n$ profile representing amino acid frequencies.
Multiple alignment of a protein family can show variations in conservation along the length of the protein:
A profile HMM, then, is a probabilistic representation of a multiple alignment.
This model can then be used to find and score less obvious potential matches of new protein sequences to find distant family members.
Consists of three states (any guesses?):
Our walk through the profile HMM is pretty much identical to a walk along the Manhattan edit graph:
In four easy steps!
1: Use BLAST to separate a protein database into families of related proteins.
2: Construct a multiple alignment for each protein family.
3: Construct a profile HMM and optimize the parameters of the model (transition and emission probabilities).
4: Align the target sequence (the one you're not sure about) against each profile HMM to find the best fit between the target sequence and a particular HMM
What problem is #4 solving?
Globins represent a large collection of protein sequences.
400 globin sequences were randomly selected from all globins and used to construct a multiple alignment.
Multiple alignment was used to train a profile HMM.
Another 625 globin proteins (not used to train the profile HMM) were selected, along with a few hundred other non-globin proteins chosen randomly.
Guess which group--the globins or non-globins--scored better against the profile HMM?
Great way to separate and identify families of proteins; in this case, separate (automatically!) globins and non-globins.