Understanding Global Sequence Alignment and the Needleman-Wunsch Algorithm

Bagavan Marakathalingasivam
15 min readNov 15, 2022

--

Most of the examples used in this article were taken from this Computational Biology Textbook, but formatted/further expanded by me.

Source

A lot of us know that we humans have evolved from apes, but although we’re quite different from them, our DNA is almost identical to its other descendants. Take chimpanzees, for example, they don’t look anything like us humans yet their DNA is almost 99% the same as ours! What’s crazy is that 1% different accounts for over 30 million different genetic variations. Computational biologists were curious to see if they could figure out a way to model how the evolution of chimps led to humans through genomic information. But there was a major problem…

You see, DNA stores the genetic information of an organism, it’s composed of just four nucleotides: Adenine, Cytosine, Thymine and Guanine (ACTG), and they are the basis of every living organism. These sets of genetic “letters” are known as base pairs, and in humans, around 3.2 billion base pairs make up the human genome.

And with the millions of possible ways, our DNA could’ve evolved from a Chimp’s DNA, scaling any method would be next to impossible. That’s why computational biologists discovered a new method to be able to effectively compare gene sequences at an extremely faster rate. This method was known as sequence alignment, more specifically the Needleman-Wunsch Algorithm.

Understanding Sequence Alignment

Source

Sequence alignment is a method that is used to compare two or more genetic strands of DNA or RNA to help in finding differences and commonalities in the genome from strand evolution. More specifically, we can take these insights and look at them from an evolutionary perspective (Orghologist) or by finding similar functions within genes (Parlogs).

After sequencing genetic information, you can perform two different types of alignment — Global alignment or Local Alignment

  • Global Alignment: When you use the whole strand to find overall commonalities and differences within the strands
  • Local Alignment: When you use a select region of sequences that contain similar patterns for a larger context

In this article, we’re going to focus on global alignment and see how we can implement it.

Since genetic information changes significantly over time, the lack of ancient genomes makes it nearly impossible to compare the genomes of living species with their ancestors. That’s why we would only be limited to comparing genomes of living descendants and finding changes that affect a genome.

Since genetic sequences tend to be fairly large, it’ll be a pain to compare two or more by hand, which is why we use different computation techniques to perform this task. To approach this from a computational lens, however, we must make certain assumptions to successfully perform sequence alignment. For example, sequence evolution mainly occurs due to three distinct operations:

  1. Nucleotide Mutation — when nucleotides in a sequence change to a completely different one (ACTGT → ACTAT)
  2. Nucleotide Deletion — when a nucleotide in the sequence is removed from it (ACTGT → ACTT)
  3. Nucleotide Insertion — when a nucleotide is added to the sequence from evolution (ACTGT → ACTGTA)

Since these changes in our sequence occur over long periods, there are millions of possibilities that could’ve resulted in the new sequence, which is why we need to find a way to pick (evaluate for) the “best’ series of events that can describe the change our genome went through. That’s why we would need to apply a specific optimality criterion — a set of conditions our alignment must satisfy — to find the most optimal series of events.

Additionally, we also have to take into consideration the probability of each operation to occur as mutations are more likely to happen than deletions and insertions. We can do this by implementing a substitution matrix and a gap penalty, but before we get into that let’s break down how we’re going to go about solving this problem by breaking it down.

Problem Formulations

Longest Common Substring

If we treat our sequences as strings consisting of letters that represent our nucleotides (ACTG), we can align them by finding the longest common substring between these sequences. For example, if we take these two sequences:

  • Sequence 1: ACGTCGATC
  • Sequence 2: ACGTCTTC

The longest common substring between these two sequences would be “ACGTC”.

The issue with this method is that we will only find constant similarities between two sequences, and won’t be able to determine how the rest of the sequence changed (which is our main intention).

Longest Common Subsequence (LCS)

The longest common subsequence is similar to the previous formulation, except this time we can allow for gaps in the subsequence we want to align. In this problem, we’re given two sequences x and y and want to find the maximum length of the common subsequences z:

Sequence 1: ACGTCATCA

Sequence 2: TAGTGTCA

The LCS for these two sequences would be “AGTTCA”. Introducing gaps gets us closer to sequence alignment, but we still aren't fully there yet.

Sequence Alignment as Edit Distance

If we go back to when we were talking about evaluating our alignment, we talked about how the three different edit operations have different probabilities of occurring. To add, the nucleotide changes also have probabilities of which nucleotide they would switch with. Since Adenine and Glutamine are both purine and Cytosine and Thymine are pyrimidines, the DNA polymerase is more likely to change between two purines or two pyrimidines since they have similar structures. To support this, we would need to develop a cost function that has more bias towards certain operations.

To help visualize this, we can create a cost matrix that adds a penalty for each possible mutation to find the most likely mutations that occur.

Varying Gap Cost Models

The energy required to create a gap is a lot more than the energy of extending an already existing gap, this means that our model would have to account for this variation.

We can do this with any of the following:

  • A linear gap penalty: has a fixed cost for all gaps
  • An affine gap penalty: consists of a high initial cost for opening a gap, then smaller costs for each extension
  • A general gap penalty: Allows for any cost function
  • Frame-aware gap penalty: Takes into account disruption to the coded frame (e.g., frameshifts which cause important modifications)

But… there’s still a problem

Even though we have all the necessary steps to perform sequence alignment, writing the program to compute it is going to be challenging. If you have some experience with programming, your first thought might be to enumerate/list all possible alignments, then evaluate each one to find the best answer. Although this works, it’s extremely computationally expensive as the number of possibilities increases.

An example is if you have two sequences with a length of 1000 nucleotides, the number of possible alignments exceeds the number of atoms in our observable universe….

Simply put, we can’t use a brute-force algorithm to go through every possibility, we need to find a new approach.

Needleman-Wunsch Algorithm

The Needleman-Wunsch Algorithm tackles this problem by using something known as dynamic programming. In essence, this concept involves breaking a problem down into different subproblems, solving each subproblem, and then storing the results to avoid re-computing (which is a big reason for larger runtimes in classical computing).

The whole idea is given two sequences: we’d want to find the longest common subsequence. Instead of maximizing the length of our common subsequence, we would need to find the most common subsequence that optimizes the score based on what we talked about earlier.

Before we go deeper into complicated jargon, let’s break down how this algorithm would work at a high-level, by doing it ourselves.

The first thing we need to do is set up scoring parameters so that we can have more biologically accurate reflections of the transition between our sequences. To do this, we need to find a gap penalty, a match bonus and a mismatch penalty, for us we’ll have them as -1, 1 and -1 respectively.

To better visualize what’s going on, we’ll draw a table where the first sequence is the row labels and the second sequence is the column labels.

Step 1: Initial Gap Penalty Values

The first step is to fill out the first row/columns of our table. We can start by adding a 0 in the upper left of the table (where there are no nucleotides) and then fill the rows and columns with our gap penalty. This means that we would subtract 2 from the previous number as we go down/to the right of our table.

Step 2: Scoring our Cells

Now we need to fill out the rest of the cells, and to do that we start by going from left to right following a couple of rules. To calculate the score, we need to find the 3 different possible moves that can lead to that cell (either down, right or diagonal).

Let’s start by moving to the right, which would represent an insertion. Since we are going to the right, we take the previous score (from the cell to the left)and add our gap penalty (-1) to it — this is because an insertion occurring will create a gap in the sequence.

We would then do the same thing moving down as it also represents an insertion. The only difference is that we will use the value from the cell directly above our current cell, and then add our penalty.

Finally, when moving diagonally down, we need to first see if the two nucleotides we are scoring are a match or a mismatch. If they are a match, we would add our match bonus (1), and if it’s a mismatch we would add our mismatch bonus (-1).

Now that we have our 3 possibilities, we take the max value of these three options and fill our cell in with that value

How our table would look like after completing the first row

We then continue to repeat this process for each row to fill out our table.

Step 3: Traceback Array

Ok so we have our table filled out but… what now? Well, since we want to find the optimal alignment we would trace back the values from our cells and choose the best ones for our alignment. We start from the last cell (bottom right) and draw arrows to represent the best “path” to get back to the original cell. If the nucleotide from the two cells you are looking at is matching, you would go diagonally up and to the right. If they don’t match, you would choose the highest value from the three surrounding cells (so essentially, we’re doing the reverse of what we just did to fill out the table).

These arrows now tell us the longest common subsequence, you see, if the arrow is pointing diagonally to a larger number, there is a mismatch, if it is a smaller number there is a match, and if it is pointing either to the left or up it’s a gap. By taking these arrows, we can create our optimal aligned sequence, and then do whatever we want with it.

Implementing our Algorithm

Now that we have a better understanding of how it works, let’s go through how we can implement our NW algorithm in python.

Feel free to check out the full source code here

Setting Everything Up

Before we get into our actual algorithm, we need to first do a couple of things to make it a lot easier. This includes importing the necessary libraries and visualizing our tables.

Importing Libraries

import numpy as np
from IPython.display import HTML, display
import pandas as pd

The first step is to import the necessary libraries to build this out. Numpy is the main library we’re going to be using, as it can easily create and work with arrays and perform mathematical functions fairly easily. The IPython library is only used to help visualize the tables a lot easier, and pandas are used to set our data frame (the array) and import any external data.

Setting up our display function

def nice_table(data_array, row_labels, col_labels):
df = pd.DataFrame(data_array, index=row_labels, columns=col_labels)
table_html = df.to_html()
return HTML(table_html)

# Testing our Function
seqA = 'GCATGCT'
seqB = 'GATACCA'


row_labels = [label for label in "-"+seqA]
column_labels = [label for label in "-"+seqB]

display(nice_table(scoring_array, row_labels, column_labels))

The first part of this code sets up our nice_table function, to help us make the arrays that we display look pretty (which makes it a lot easier to understand what’s going on). The next part is used to just test our function by setting up two random sequences and displaying their name as row and column labels. Here’s what was outputted:

Now that we’ve set that up, we can go into our actual algorithm

Building our Scoring Array

num_rows = len(seqA) + 1
num_cols = len(seqB) + 1
row_labels = [label for label in "-"+seqA]
col_labels = [label for label in "-"+seqB]

scoring_array = np.full([num_rows, num_cols], 0)

# Setting up our scoring parameters
gap_penalty = -1
match_bonus = 1
mismatch_penalty = -1

# Iterating over each cell to apply our score
for row in range(n_rows):
for col in range(n_columns):
if row == 0 and col == 0:
# We're in the upper right corner
score = 0
elif row == 0:
# First row
#Look up the score of the previous cell (to the left) in the score array
previous_score = scoring_array[row,col - 1]
# add the gap penalty to it's score
score = previous_score + gap_penalty
elif col == 0:
# First column
previous_score = scoring_array[row -1,col]
score = previous_score + gap_penalty

This part of the for loops gives us the initial scoring of the first column and row, we can then use that for the rest of our cells. This next block of code is part of the for loop and will follow the calculations we discussed earlier to determine what to score each cell

         else: 
# Finding scores for the adjacent cells
left_cell = scoring_array[row, col-1]
up_cell = scoring_array[row-1, col]
diag_cell = scoring_array[row-1, col-1]

score_left = left_cell + gap_penalty
score_up = up_cell + gap_penalty
# Checking if there is a match/mismatch
if seqA[row-1] == seqB[col-1]:
score_diag = diag_cell + match_bonus
else:
score_diag = diag_cell + mismatch_penalty
# Choose the highest score as ours
score = max([score_left, score_up, score_diag

scoring_array[row, col] = score

Now that we’ve iterated through each cell, we can display it using our nice_table function

print("Scoring Array:")
display(nice_table(scoring_array, row_labels, column_labels))

Building a Traceback Array

Making our traceback array will be very simple, all we need to do is pick which way we want our arrow to point based on the max score.

# Using arrow emojis for traceback array
up_arrow = "⬆"
down_arrow = "⬇"
right_arrow = "➡"
left_arrow = "⬅"
right_down_arrow = "↘"
left_up_arrow = "↖"
arrow = "-" # placeholder

# Continue from our previous code
for row in range(num_rows):
....
score = max([score_left, score_up, score_diag])

if score == score_left:
arrow = left_arrow
elif score == score_up:
arrow = up_arrow
elif score == score_diag:
arrow = left_up_arrow

traceback_array[row, col] = arrow
scoring_array[row, col] = score

print("Traceback Array: ")
display(nice_table(traceback_array, row_labels, column_labels))

Here’s the output of our traceback array. We can see that if we follow the arrow from the last cell, it finds the most optimal path from this code, and so now we need to take this information and write out our optimal alignment.

def traceback_alignment(traceback_array, seqA, seqB, up_arrow=up_arrow, left_arrow=left_arrow, left_up_arrow=left_up_arrow, stop="-"):
num_rows = len(seqA) + 1
num_cols = len(seqB) + 1

row = len(seqA)
col = len(seqB)
arrow = traceback_array[row, col]
aligned_seqA = ""
aligned_seqB = ""
alignment_indicator = ""

We can start by creating a function traceback_alignment which has our traceback array and sequences as its parameters. After doing this, we move on to initializing a couple of variables such as aligned_seqA, aligned_seqB which will have our optimal alignment.

Now, we need to go through our traceback array and append the necessary nucleotides and gaps that represent our alignment.

while arrow != "-":
# print("Current Row:", row)
# print("Current Column:", col)
arrow = traceback_array[row, col]
# print("Arrow:", arrow)

# Insert indel into top sequence
if arrow == up_arrow:
# print("Insert indel into top sequence")
aligned_seqB = "-"+aligned_seqB
aligned_seqA = seqA[row-1] + aligned_seqA
alignment_indicator = " " + alignment_indicator
row -= 1
# Determines if there is a mismatch or match
elif arrow == left_up_arrow:
seqA_char = seqA[row-1]
seqB_char = seqB[col-1]
aligned_seqA = seqA[row-1] + aligned_seqA
aligned_seqB = seqB[col-1] + aligned_seqB
if seqA_char == seqB_char:
alignment_indicator = "|" + alignment_indicator
else:
alignment_indicator = " " + alignment_indicator
row -= 1
col -= 1
# Insert indel into sequence
elif arrow == left_arrow:
aligned_seqA = "-" + aligned_seqA
aligned_seqB = seqB[col-1] + aligned_seqB
alignment_indicator = " " + alignment_indicator
col -= 1

elif arrow == stop:
break
else:
raise ValueError(f"Traceback array entry at", row, ",", col, " isn't recognized as an arrow")

print(aligned_seqA)
print(alignment_indicator)
print(aligned_seqB)

return aligned_seqA, aligned_seqB

traceback_alignment(traceback_array, seqA, seqB)

Although this block might seem long, all we’re doing is iterating through each arrow and determining what operation took place based on it. If the arrow is a diagonal one, that means it is either a mismatch or a match. To determine which one it is, we simply compare the two with an = operation and if statements. If it is equal, then we add it to our alignment, otherwise, we leave a space.

To represent gaps, we see if the traceback arrow is pointing upwards or to the left, and add a to our aligned sequence.

Note: It’s also important to raise ValueError if there isn’t an arrow in our traceback array, as it could help in preventing potential mistakes that could occur.

The output of our program — the lines represent the matching nucleotides, whereas the dash represents gaps

Adding our Scoring Matrix

As I mentioned earlier, not all substitutions are equally common, which is why we need to add our scoring function for mismatches to develop a more accurate alignment algorithm. Transversions are when the purines mutate into pyrimidines, and transitions are when they switch with their class (purines with purines, pyrimidines with pyrimidines).

nucleotides = "AGCT"

nucleotide_indices = {nucleotide:i for i, nucleotide in enumerate(nucleotides)}

match_score = 1
transversion_score = -2
transition_score = -0.5

scoring_matrix = np.full([len(nucleotides), len(nucleotides)], transition_score)

chemical_class = {"A":"Purine", "T":"Pyrimidine", "C":"Pyrimidine","G":"Purine"}

We start by creating a dictionary of our nucleotides as well as giving scores for each potential indel.

for n1 in nucleotides:
for n2 in nucleotides:
n1_index = nucleotide_indices[n1]
n2_index = nucleotide_indices[n2]
if n1 == n2:
scoring_matrix[n1_index][n2_index] = match_score
continue

n1_class = chemical_class[n1]
n2_class = chemical_class[n2]

if n1_class == n2_class:
scoring_matrix[n1_index][n2_index] = transition_score
else:
scoring_matrix[n1_index][n2_index] = transversion_score

display(nice_table(scoring_matrix, row_labels=[n for n in nucleotides], col_labels=[n for n in nucleotides]))

This block iterates through our dictionary and creates our scoring matrix based on each class and the indel scores.

The output of our scoring matrix

We can then create a function to call in our algorithm.

def score_match(n1, n2, scoring_matrix, scoring_matrix_indices={"A":0, "G":1, "C":2, "T":3}):
# return score for substitution based on scoring matrix
return scoring_matrix[scoring_matrix_indices[n1], scoring_matrix_indices[n2]]

To integrate it into our algorithm, we just add the returned score value to our diagonal score (since that represents mutations)

diag_score = diag_cell + \
score_match(current_seqA, current_seqB, scoring_matrix)

Once you put everything together, your outputs should look something like this:

Where the aligned sequences should be: (‘GCATGCT-’, ‘G-ATACCA’)

And We’re Done

This algorithm was only tested on two randomly generated sequences, so the next steps you could take would be to try testing it out on already existing data and see if the alignment checks out. (For example, in this notebook I aligned a BRCA1 mutated gene sequence with its original gene).

This is only one of the many ways we can go about sequence alignment. The NW algorithm still comes with its limitations, for example, when handling multiple alignment (when there is more than 2 sequences) the amount of time it takes to compute the optimal length will increase 2ⁿ, where n represents the numbers of sequences. However, this isn’t a new space and we’ve been able to overcome a lot of challenges. This is also only one of the many things we can do with sequenced DNA information, and with new advances in genome sequencing starting to grow, the possibilities are going to be endless.

I can’t believe we’re already at the end of this article! I hope learned and built something new! If you have any questions, feel free to reach out to me on LinkedIn or Twitter :)

--

--