CSE 382 Algorithms and Data Structures

Barbara Nostrand, Ph.D.

Electrical Engineering and Computer Science


Laboratory 10 - Bioinformatics

Matching Gene Sequences

DNA can be represented digitally by long strings of A, C, G, and T.  One of the most simplistic tests to determine genetic similarity is the longest subsequence of nucleotides common to both species.  Determining this similarity value is of great use to bioinformatic scientists to determine position on an evolutionary tree.  Global String Alignment Given two strings,  find their best alignment.  Suppose our strings are "AGCGT" and "CAGT".  An alignment consists of writing the strings,  in order,  so that the letters line up in some way.  If the strings have different lengths,  as is the case here,  then we will need to insert some "gap symbol" in the shorter string.  We use "-" as our gap symbol. 

AGCGT
-CAGT
AGCGT
C-AGT
AGCGT
CA-GT
AGCGT
CAG-T
AGCGT
CAGT-
AGCG-T
C--AGT
-AGCGT
C--AGT
AGC-GT
--CAGT

Sometimes we will insert gap symbols just to make things line up better,  regardless of the lengths.  Which alignment is the best?  A scoring criterion is used to evaluate the quality of an alignment. 

It is a common problem in computer science to determine the longest common substring of two strings, particularly useful in web search engines.  In bioinformatics, the largest subsequence of two strings is far more useful (and difficult) to determine.  It is simple to write a brute-force program to solve the problem.  However, DNA strings can be millions of characters long, and thus a brute-force method could take hours (or even years) to give results for real-world data.  This assignment implements the dynamic programming method of efficiently solving the longest common subsequence problem in O(n2) time.

Introduction

The term "Dynamic programming" refers to the method of finding optimal solutions to a large problem by solving several smaller problems and keeping track of those smaller solutions, usually in order to reuse them. 

This assignment is a review of your skills as a programmer using the string and two-dimensional array data structures.

                  functions.h header file for functions

                  functions.cpp implementation of the functions

                  main.cpp main program implementation that calls the functions

Implementation

For this assignment only,  you may choose between C,  C++,  and Java as your implementation language.  Please discuss why you chose your implementation language in your documentation for this project. 

Grading

You will be graded on the following criteria:

Extra credit will be given for including the following features:

Algorithm Pseudocode

Step 1:  initialize variables

      string s = ' ' + string1

      string t = ' ' + string2

      Twidth = s.length

      Theight = t.length

      int T[Twidth][Theight]

      for (i=0; i<Twidth; i++)

            T[i,0] = 0

      end-for

      for (j=0; j<Theight; j++)

            T[0,j] = 0

      end-for

Step 2:  Run LCS algorithm

      for (j=1; j<Theight; j++)

            for (i=1; i<Twidth; i++)

                  if (s[i] == t[j])   // match in subsequence

                        T[i,j] = T[i-1,j-1] + 1

                  else                // no match, use previous value

                        T[i,j] = max (T[i-1,j], T[i,j-1])

                  end-if

            end-for

      end-for

      subsequenceLength = T[Twidth][Theight] 

 

Testing

Test your program on the following sets of input data.

 

Input Set #1

String1: ATCCGACAAC

String2: ATCGCATCTT

Output: 7 (ATCGCAC) 

Input Set #2

String1: AACGTTCOGMA

String2: GGATACCASAT

Output: errors in String1 (aacgttcOgMa)

      error in String 2 (ggataccaSat) 

Input Set #3

String1: AAAATTTT

String2: TAAATG

Output: 4 (AAAT) 

Input Set #4

String1: TAGTAGTAGTAGTAGTAG

String2: CATCATCATCATCA

Output: 8 (ATATATAT)  or  8 (CACACACA)

Example

Rule for T[i,j]:

      if (string1[i] == string2[j])

      otherwise

   string1 = "AAAATTTT"

   string2 = "TAAATG" 
 
 

string1 = "ATCCGACAAC"

string2 = "ATCGCATCTT" 
 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1 
 

+1

Phrases you should now understand:

Dynamic Programming. 

CollaborationYou will complete this project with your lab partner. 
ImplementationYou must implement your system as a Java application.  You may use Jgrasp.  Do not use netbeans or produce a Java applet. 
EnvironmentYour program must execute on the podium computer in our classroom.
Turn inYou must turn in both a report containing complete program sources and an executable electronic version of your program.

Last modified: 2008 MAR 31
bnostran@syr.edu