# 2.4: Dynamic Programming - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Before proceeding to a solution of the sequence alignment problem, we first discuss dynamic programming, a general and powerful method for solving problems with certain types of structure.

## Theory of Dynamic Programming

Dynamic programming may be used to solve problems with:

1. Optimal Substructure: The optimal solution to an instance of the problem contains optimal solutions to subproblems.
2. Overlapping Subproblems: There are a limited number of subproblems, many/most of which are repeated many times.

Dynamic programming is usually, but not always, used to solve optimization problems, similar to greedy algorithms. Unlike greedy algorithms, which require a greedy choice property to be valid, dynamic programming works on a range of problems in which locally optimal choices do not produce globally optimal results. Appendix 2.11.3 discusses the distinction between greedy algorithms and dynamic programming in more detail; generally speaking, greedy algorithms solve a smaller class of problems than dynamic programming.

In practice, solving a problem using dynamic programming involves two main parts: Setting up dynamic programming and then performing computation. Setting up dynamic programming usually requires the following 5 steps:

1. Find a ’matrix’ parameterization of the problem. Determine the number of dimensions (variables).
2. Ensure the subproblem space is polynomial (not exponential). Note that if a small portion of subproblems are used, then memoization may be better; similarly, if subproblem reuse is not extensive, dynamic programming may not be the best solution for the problem.
3. Determine an effective transversal order. Subproblems must be ready (solved) when they are needed, so computation order matters.

4. Determine a recursive formula: A larger problem is typically solved as a function of its subparts.

5. Remember choices: Typically, the recursive formula involves a minimization or maximization step. More- over, a representation for storing transversal pointers is often needed, and the representation should be polynomial.

Once dynamic programming is setup, computation is typically straight-forward:
1. Systematically fill in the table of results (and usually traceback pointers) and find an optimal score.

2. Traceback from the optimal score through the pointers to determine an optimal solution.

## Fibonacci Numbers

The Fibonacci numbers provide an instructive example of the benefits of dynamic programming. The Fibonacci sequence is recursively defined as F0 = F1 = 1, Fn = Fn−1 + Fn−2 for n ≤ 2. We develop an algorithm to compute the nth Fibonacci number, and then refine it first using memoization and later using dynamic programming to illustrate key concepts.

The Na ̈ıve Solution

The simple top-down approach is to just apply the recursive definition. Listing 1 shows a simple Python implementation.

But this top-down algorithm runs in exponential time. That is, if T(n) is how long it takes to compute the nth Fibonacci number, we have that (T(n)=T(n-1)+T(n-2)+O(1), ext { so } T(n)=Oleft(phi^{n} ight))9. The problem is that we are repeating work by solving the same subproblem many times.

The Memoization Solution

A better solution that still utilizes the top-down approach is to memoize the answers to the subproblems. Listing 2 gives a Python implementation that uses memoization.

Note that this implementation now runs in T(n) = O(n) time because each subproblem is computed at most once.

The Dynamic Programming Solution

For calculating the nth Fibonacci number, instead of beginning with F(n) and using recursion, we can start computation from the bottom since we know we are going to need all of the subproblems anyway. In this way, we will omit much of the repeated work that would be done by the na ̈ıve top-down approach, and we will be able to compute the nth Fibonacci number in O(n) time.

As a formal exercise, we can apply the steps outlined in section 2.4.1:

1. Find a ’matrix’ parameterization: In this case, the matrix is one-dimensional; there is only one

parameter to any subproblem F(x).

2. Ensure the subproblem space is polynomial: Since there are only n − 1 subproblems, the space is

polynomial.

3. Determine an effective transversal order: As mentioned above, we will apply a bottom-up transversal order (that is, compute the subproblems in ascending order).
4. Determine a recursive formula: This is simply the well-known recurrance F(n) = F(n−1)+F(n−2).
5. Remember choices: In this case there is nothing to remember, as no choices were made in the recursive formula.
Listing 3 shows a Python implementation of this approach.

This method is optimized to only use constant space instead of an entire table since we only need the answer to each subproblem once. But in general dynamic programming solutions, we want to store the solutions to subproblems in a table since we may need to use them multiple times without recomputing their answers. Such solutions would look somewhat like the memoization solution in Listing 2, but they will generally be bottom-up instead of top-down. In this particular example, the distinction between the memoization solution and the dynamic programming solution is minimal as both approaches compute all subproblem solutions and use them the same number of times. In general, memoization is useful when not all subproblems will be computed, while dynamic programming saves the overhead of recursive function calls, and is thus preferable when all subproblem solutions must be calculated10. Additional dynamic programming examples may be found online [7].

## Sequence Alignment using Dynamic Programming

We are now ready to solve the more difficult problem of sequence alignment using dynamic programming, which is presented in depth in the next section. Note that the key insight in solving the sequence alignment problem is that alignment scores are additive. This allows us to create a matrix M indexed by i and j, which are positions in two sequences S and T to be aligned. The best alignment of S and T corresponds with the best path through the matrix M after it is filled in using a recursive formula.

By using dynamic programming to solve the sequence alignment problem, we achieve a provably optimal solution, that is far more efficient than brute-force enumeration.

9 φ is the golden ratio, i.e. (frac{1+sqrt{5}}{2})

10 In some cases dynamic programming is virtually the only acceptable solution; this is the case in particular when dependency chains between subproblems are long: in this case, the memoization-based solution recurses too deeply, and causes a stack overflow

## Dynamic programming

Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.

In both contexts it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. While some decision problems cannot be taken apart this way, decisions that span several points in time do often break apart recursively. Likewise, in computer science, if a problem can be solved optimally by breaking it into sub-problems and then recursively finding the optimal solutions to the sub-problems, then it is said to have optimal substructure.

If sub-problems can be nested recursively inside larger problems, so that dynamic programming methods are applicable, then there is a relation between the value of the larger problem and the values of the sub-problems. [1] In the optimization literature this relationship is called the Bellman equation.

## DLS Theory

The theory of DLS can be introduced utilizing a model system of spherical particles in solution. According to the Rayleigh scattering (Figure (PageIndex<1>)), when a sample of particles with diameter smaller than the wavelength of the incident light, each particle will diffract the incident light in all directions, while the intensity (I) is determined by ef <1>, where (I_0) and (&lambda) is the intensity and wavelength of the unpolarized incident light, (R) is the distance to the particle, (&theta) is the scattering angel, (n) is the refractive index of the particle, and (r) is the radius of the particle.

If that diffracted light is projected as an image onto a screen, it will generate a &ldquospeckle" pattern (Figure (PageIndex<2>) ) the dark areas represent regions where the diffracted light from the particles arrives out of phase interfering destructively and the bright area represent regions where the diffracted light arrives in phase interfering constructively.

Figure (PageIndex<2>) Typical speckle pattern. A photograph of an objective speckle pattern. This is the light field formed when a laser beam was scattered from a plastic surface onto a wall. (Public Domain Epzcaw).

In practice, particle samples are normally not stationary but moving randomly due to collisions with solvent molecules as described by the Brownian motion, ef<2>, where (overline<(Delta x)^<2>> ) is the mean squared displacement in time t, and D is the diffusion constant, which is related to the hydrodynamic radius a of the particle according to the Stokes-Einstein equation, ef <3>, where kB is Boltzmann constant, T is the temperature, and &mu is viscosity of the solution. Importantly, for a system undergoing Brownian motion, small particles should diffuse faster than large ones.

As a result of the Brownian motion, the distance between particles is constantly changing and this results in a Doppler shift between the frequency of the incident light and the frequency of the scattered light. Since the distance between particles also affects the phase overlap/interfering of the diffracted light, the brightness and darkness of the spots in the &ldquospeckle&rdquo pattern will in turn fluctuate in intensity as a function of time when the particles change position with respect to each other. Then, as the rate of these intensity fluctuations depends on how fast the particles are moving (smaller particles diffuse faster), information about the size distribution of particles in the solution could be acquired by processing the fluctuations of the intensity of scattered light. Figure (PageIndex<3>) shows the hypothetical fluctuation of scattering intensity of larger particles and smaller particles.

Figure (PageIndex<3>) Hypothetical fluctuation of scattering intensity of larger particles and smaller particles.

In order to mathematically process the fluctuation of intensity, there are several principles/terms to be understood. First, the intensity correlation function is used to describe the rate of change in scattering intensity by comparing the intensity I(t) at time t to the intensity I(t + &tau) at a later time (t + &tau), and is quantified and normalized by ef <4>and ef <5>, where braces indicate averaging over t.

[ G_ <2>( au ) = langle I(t)I(t + au) angle label <4>]

Second, since it is not possible to know how each particle moves from the fluctuation, the electric field correlation function is instead used to correlate the motion of the particles relative to each other, and is defined by ef <6>and ef <7>, where E(t) and E(t + &tau) are the scattered electric fields at times t and t+ &tau.

[ G_<1>( au ) = langle E(t)E(t + au ) angle label <6>]

For a monodisperse system undergoing Brownian motion, g1(&tau) will decay exponentially with a decay rate &Gamma which is related by Brownian motion to the diffusivity by ef <8>, ef <9>, and ef <10>, where q is the magnitude of the scattering wave vector and q 2 reflects the distance the particle travels, n is the refraction index of the solution and &theta is angle at which the detector is located.

For a polydisperse system however, g1(&tau) can no longer be represented as a single exponential decay and must be represented as a intensity-weighed integral over a distribution of decay rates G(&Gamma) by ef <11>where G(&Gamma) is normalized, ef <12>.

[ int ^_ <0>G(Gamma ) dGamma = 1 label <12>]

Third, the two correlation functions above can be equated using the Seigert relationship based on the principles of Gaussian random processes (which the scattering light usually is), and can be expressed as ef <13>, where &beta is a factor that depends on the experimental geometry, and B is the long-time value of g2(&tau), which is referred to as the baseline and is normally equal to 1. Figure (PageIndex<4>) shows the decay of g2(&tau) for small size sample and large size sample.

Figure (PageIndex<4>) Decay of g2(&tau) for small size sample and large size sample. Malvern Instruments Ltd., Zetasizer Nano Series User Manual, 2004. Copyright: Malvern Instruments Ltd. (2004).

When determining the size of particles in solution using DLS, g2(&tau) is calculated based on the time-dependent scattering intensity, and is converted through the Seigert relationship to g1(&tau) which usually is an exponential decay or a sum of exponential decays. The decay rate &Gamma is then mathematically determined (will be discussed in section ) from the g1(&tau) curve, and the value of diffusion constant D and hydrodynamic radius a can be easily calculated afterwards.

## Week 10: Cloud computing and course wrap up

• Overview video: Launching an AWS EC2 instance,
• Section 1 Walk through: Starting your own computer in the cloud,
• Section 2 Walk through: Accessing and using your AWS instance,
• AWS Console URL: https://awsed.ucsd.edu/.
• Video live stream interview with leading bioinformatics and genomics scientists from industry including Dr Ali Crawford (Associate Director, Scientific Research, Illumina Inc.), Dr. Bjoern Peters (Full Professor and Principal Investigator, La Jolla Institute) and Dr. Ana Grant (Director of Research Informatics, Synthetic Genomics Inc.).
• Biological network analysis
• Cancer genomics
• Unix for Bioinformatics
• Structural Bioinformatics and computational drug design
• Introduction to the tidyverse
• Bioinformatics and genomics in industry and beyond