Find the length of DNA sequence?

Find the length of DNA sequence?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Given $N=5 imes10^3$ and mutation rate is $mu=10^{-5}$ per site, find the length of a DNA sequence so that the probability of mutation occuring M, is greater or equal than 0.95.

Is there a method or a formula for this type of calculation?

The mutation rate per haplotype per site is $mu = 10^{-5}$. Assuming diploidy and a population size of $N=5000$, the population wide mutation rate per site is $10^{-5} * 5000 * 2 = 0.1$.

$0.1$ is hence the probability that a mutation occurs a at a given site (in the whole population). For 10 sites the probability that a mutation occurs at at least one site is $1 - (1-0.1)^{10} = 0.65$.

The probability we are aiming for is 0.95. So let's write the equation

$$1 - (1-0.1)^{x} = 0.95$$

, where $x$ is the number of sites we are looking for. You just have to solve for $x$ now and round up to the larger integer.

  • Genome sequencing will greatly advance our understanding of genetic biology and has vast potential for medical diagnosis and treatment.
  • DNA sequencing technologies have gone through at least three &ldquogenerations&rdquo: Sanger sequencing and Gilbert sequencing were first-generation, pyrosequencing was second-generation, and Illumina sequencing is next-generation.
  • Sanger sequencing is based on the use of chain terminators, ddNTPs, that are added to growing DNA strands and terminate synthesis at different points.
  • Illumina sequencing involves running up to 500,000,000 different sequencing reactions simultaneously on a single small slide. It makes use of a modified replication reaction and uses fluorescently-tagged nucleotides.
  • Shotgun sequencing is a technique for determining the sequence of entire chromosomes and entire genomes based on producing random fragments of DNA that are then assembled by computers which order fragments by finding overlapping ends.
  • DNA sequencing: a technique used in molecular biology that determines the sequence of nucleotides (A, C, G, and T) in a particular region of DNA
  • dideoxynucleotide: any nucleotide formed from a deoxynucleotide by loss of an a second hydroxyl group from the deoxyribose group
  • in vitro: any biochemical process done outside of its natural biological environment, such as in a test tube, petri dish, etc. (from the Latin for &ldquoin glass&rdquo)

Dideoxy sequencing

Recall that DNA polymerases incorporate nucleotides (dNTPs) into a growing strand of DNA, based on the sequence of a template strand. DNA polymerases add a new base only to the 3&rsquo-OH group of an existing strand of DNA this is why primers are required in natural DNA synthesis and in techniques such as PCR. Most of the currently used DNA sequencing techniques rely on the random incorporation of modified nucleotides called terminators. Examples of terminators are the dideoxy nucleotides (ddNTPs), which lack a 3&rsquo-OH group and therefore cannot serve as an attachment site for the addition of new bases to a growing strand of DNA (Figure (PageIndex<1>)). After a ddNTP is incorporated into a strand of DNA, no further elongation can occur. Terminators are labeled with one of four fluorescent dyes, each specific for one the four nucleotide bases.

Figure (PageIndex<1>): ddNTPs (Original-Deyholos-CC:AN)

To sequence a DNA fragment, you need many copies of that fragment (Figure (PageIndex<2>)). Unlike PCR, DNA sequencing does not amplify the target sequence and only one primer is used. This primer is hybridized to the denatured template DNA, and determines where on the template strand the sequencing reaction will begin. A mixture of dNTPs, fluorescently labeled terminators, and DNA polymerase is added to a tube containing the primer-template hybrid. The DNA polymerase will then synthesize a new strand of DNA until a fluorescently labeled nucleotide is incorporated, at which point extension is terminated. Because the reaction contains millions of template molecules, a corresponding number of shorter molecules is synthesized, each ending in a fluorescent label that corresponds to the last base incorporated.

Figure (PageIndex<2>): A sequencing reactions begins with many identical copies of a template DNA fragment. The template is denatured, then primers are annealed to the template. Following the addition of polymerase, regular dNTPS, and fluorescently labeled terminators, extension begins at the primer site. Elongation proceeds until a fluorescently labeled terminator (shown here in color) is incorporated. (Original-Deyholos-CC:AN)

The newly synthesized strands can be denatured from the template, and then separated electrophoretically based on their length (Figure (PageIndex<3>)). Since each band differs in length by one nucleotide, and the identity of that nucleotide is known from its fluorescence, the DNA sequence can be read simply from the order of the colors in successive bands. In practice, the maximum length of sequence that can be read from a single sequencing reaction is about 700 bp.

Figure (PageIndex<3>): Fluorescently labeled products can be separated electrophoretically based on their length. (Original-Deyholos-CC:AN)

A particularly sensitive electrophoresis method used in the analysis of DNA sequencing reactions is called capillary electrophoresis (Figure (PageIndex<6>)). In this method, a current pulls the sequencing products through a gel-like matrix that is encased in a fine tube of clear plastic. As in conventional electrophoresis, the smallest fragments move through the capillary the fastest. As they pass through a point near the end of the capillary, the fluorescent intensity of each dye is read. This produces a graph called a chromatogram. The sequence is determined by identifying the highest peak (i.e. the dye with the most intense fluorescent signal) at each position.

Figure (PageIndex<4>): Fluorescently labeled products can be separated by capillary electrophoresis, generating a chromatogram from which the sequence can be read.(Wikipedia-Abizar Lakdawalla-PD)

Structure of DNA by Watson and Crick

Watson and Crick displayed the structure of DNA after studying the manuscript of the two scientists Linus Pauling and Corey. In 1953, Linus Pauling and Corey gave the 3D-structure of nucleic acid, which was not successful. Then, (in early 1953) Watson and Crick together combined the data of physical and chemical properties and proposed a double-helical structure of DNA. The main characteristics of Watson and Crick model of DNA include:

Physical Properties of DNA

  • According to the Watson and Crick model, the DNA is a double-stranded helix, which consists of two polynucleotide chains. The two polynucleotide chain are spirally or helically twisted, which gives it a twisted ladder-like look.
  • Both the polynucleotide strands of DNA have the opposite polarities, which mean that the two strands will run in the antiparallel direction, i.e. one in 5’-3’ and other in 3’-5’ direction.
  • The diameter of ds-stranded DNA helix is 20Å.
  • The distance between the two nucleotidesor internuclear distance is 3.4Å. The length of DNA helix is 34Å after a full turn and it possesses 10 base pairs per turn.
  • The DNA is twisted in “Right-handed direction” or we can say in a “Clockwise direction”.
  • Turning of DNA causes a formation of wide indentations, i.e. “Major groove”. The distance between the two strands forms a narrow indentation, i.e. “Minor groove”. The formation of major and minor grooves result after the DNA coiling and the grooves also act as a site of DNA binding proteins.

Chemical Properties of DNA

  • There are four nucleotide bases present in the polynucleotide chain like adenine, guanine, cytosine and thymine. Adenine and guanine are the two purine bases, which have a single ring structure. Cytosine and thymine are the two pyrimidine bases, which have the double-ring structure.
  • The two strands are joined together by the “Complementary base pairing” of the nitrogenous bases. Therefore, a purine base will complementarily pair with the pyrimidine base, in which ‘Adenine’ pairs with ‘Thymine’ and ‘Guanine’ pairs with ‘Cytosine’.
  • The nucleotide bases in the polynucleotide strands of DNA will join with each other through a strong hydrogen bond.
  • Adenine complementarily pairs with thymine through two hydrogen bonds, whereas guanine complementarily pairs with cytosine by means of three hydrogen bonds.
  • The nucleotide base composition of DNA follows the Chargaff’s rule where the sum of purines is equal to the number of pyrimidines. The base composition of A + G = T + C obeys the Chargaff’s rule, but the base composition of A + T is not equal to the G + C.
  • Polynucleotide strands of DNA consist of three major components, namely nitrogenous bases, deoxyribose sugar and a phosphate group.
  • The backbone of DNA consists of the sugar-phosphate backbone. The sugar-phosphate backbone holds both the polynucleotide strands of DNA by means of “Phosphodiester bond”. Therefore, the bonding between sugar and phosphates, i.e. phosphodiester bond and the bonding between nitrogenous bases, i.e. hydrogen bond contributes to the “DNAStability”.


The DNA is a supermodel proposed by Watson and Crick in the year 1953. The discovery of double helix DNA was not possible without the collaboration of Maurice Wilkins and Rosalind Franklin. Maurice Wilkins and Rosalind Franklin discovered the picture of DNA through X-ray crystallography. The X-ray diffraction picture of DNA helped Watson and Crick to further study the DNA structure and components. By this, Watson and Crick proposed a model for DNA known as Watson and Crick’s model of double-helical DNA.

The DNA is the largest biomolecule which contains all the genetic information of the person to build an organism or a life form. The study of DNA double-helical structure helps us to know about the chemical and physical properties of DNA, apart from the property of DNA being a “Genetic material”.

Additional Online Resources

Human Genome Project
This site, entitled, “DNA Forensics”, is presented by the Human Genome Project. It provides a comprehensive overview of the topic covered in this BLOSSOMS lesson.

Learn.Genetics: Gel Electrophoresis
This presentation on DNA forensics is provided by “Learn.Genetics” of the Genetics Science Learning Center at the University of Utah.

This is the Learn.Genetics main site, providing links to a wide range of resources for learning about genetics.

MIT BLOSSOMS video: Visit to Police Identification Lab
Watch optional video: Visit to Police Identification Lab in Cambridge, MA to see how DNA is extracted from evidence at crime scenes.

How to count non-DNA bases in a sequence using Python

I noticed recently that two particular questions are popping up quite regularly in my search logs: "how to count non-DNA bases in a sequence" and "how to tell if a sequence contains DNA" (presumably as opposed to protein). It struck me that the second question is really a special case of the first – once we have a way to count the number of DNA bases in a sequence, we can simply apply a rule that if more than 80% (or any other number we choose) of bases in a sequence are A,T,G or C, then it is probably DNA.

Let's start with the simplest thing that we think will work – we'll simply count the number of A, T, G and C characters in a sequence, then divide by the length and multiply by 100 to get a percentage. For this example I'm using a DNA sequence that has three non-ATGC characters: one each of N, Y and R. I've included the division fix at the start of the code in case you want to run this on Python 2:

The output from this bit of code shows that it's working as expected:

However, in some circumstances, we might want to allow characters other than A,T,G and C in our DNA sequences. Take a look at this table showing the set of standard IUPAC ambiguity codes:

Depending on which subset of these we want to allow, we might want to count as many as sixteen different characters. Rather than cram sixteen different calls to count() into one line, it's probably better to loop through the allowed characters and build up the count one at a time. Here's a bit of code to do that, using a list to define the set of allowed characters. For this example I'm allowing the four standard bases plus purines (R) and pyrimidines (Y):

As expected, the answer is higher than in our first example because we are now counting the R and Y as DNA bases:

This seems like a perfect bit of code to turn into a function. We'll make the DNA sequence and the list of allowed bases into function arguments, and use a sensible default of counting just ATGC characters.

Notice how we've changed both the input sequence and the allowed bases to upper case, to make sure that the function will work regardless of the case of the inputs. Here are a few quick tests:

Having written this function, it's pretty straightforward to define a function to test if a sequence is DNA. To make the function as flexible as possible, we'll assign sensible defaults to both the allowed bases and the minimum percentage of bases that must match. We'll pass the input sequence and the list of allowed bases through to the count_dna() function, and then compare the result of that call to the minimum. Here's the function along with a couple of lines to test it:

As you can see, the function is very concise – we simply ask whether the percentage of DNA bases returned by our earlier function is greater than the minimum, and return the result. As the output shows, we can make the test more stringent by increasing the minimum, or more lenient by allowing some ambiguous bases:

Another, much more concise way to write the counting function would be to use a list comprehension to select just the characters that are in some group:

Sequence Assembly Problem

The sequence assembly problem can be described as follows.

Given a set of sequences, find the minimal length string containing all members of the set as substrings.

This problem is further complicated due to the existence of repetitive sequences in the genome as well as substitutions or mutations within them.

The sequence assembly problem can be compared to a real-life scenario as follows.

Assume that you take many copies of a book, pass each of them through a shredder with a different cutter, and then you try to make the text of the book back together just by gluing together the shredded pieces. It is obvious that this task is pretty difficult. Furthermore, there are some extra practical issues as well. The original copy may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Parts from another book may have also been added in, and some shreds may be completely unrecognizable.

It sounds very confusing and quite impossible to be carried out. This problem is known to be NP-Complete. NP-complete problems are problems whose status is unknown. No polynomial-time algorithm has yet been discovered for any NP-complete problem, nor has anybody yet been able to prove that no polynomial-time algorithm exists for any of them. However, there are greedy algorithms to solve the sequence assembly problem, where experiments have proven to perform fairly well in practice.

A common method used to solve the sequence assembly problem and perform sequence data analysis is sequence alignment.

Find the length of DNA sequence? - Biology

In the exercise below you will be given an unknown DNA sequence and asked to use a web tool to translate the sequence into an amino acid sequence and hopefully identify the proper reading frame. You will then save this amino acid sequence to a word processing program (or e-mail it to yourself) if you want to use it in the next exercise.

Obtaining your sequence
In the lab, this might be obtained by sequencing a clone from a cDNA library or by isolating an amplified DNA fragment from a PCR amplification. Often, when we sequence such a product we find we have an unexpected fragment of DNA which we need to analyze. Here we will provide a partial sequence at random from our database of sequences. A partial nucleotide sequence will appear in the window below after you click on the Get Gene Sequence button.

Translating the Sequence
Several sites on the web perform a translation of an input sequence. Clicking on the Expasy link below will open a new window giving you access to a translation tool. Translating the DNA sequence is done by reading the nucleotide sequence three bases at a time and then looking at a table of the genetic code to arrive at an amino acid sequence. This program examines the input sequence in all six possible frames (i.e. reading the sequence from 5' to 3' and from 3' to 5' starting with nt 1, nt 2 and nt 3). What we typically look for in identifying the proper translation is the frame that gives the longest amino acid sequence before a stop codon is encountered. (Since there are 64 codons and three code for nonsense, we expect a stop codon to appear on average once every 20 amino acids if we simply read a sequence "out of frame". However, "on average" is just that, and it is possible to have an incorrect reading frame give an extended sequence with no stop codons. The next exercise will address that problem.

We will use Expasy tools for translation. Clicking on it will open a new window so you can return to this window for instructions and to copy your sequence.


DNA Sequencing and Genomics

DNA sequencing determines the order of DNA nucleotides, or bases, in a genome – the order of adenines (A), cytosines (C), guanines (G), and thymines (T) that make up an organism’s DNA. DNA Sequencing can be performed using different methods. Sanger sequencing (chain termination) method was most widely used for DNA sequencing until pyrosequencing came into picture. Sanger’s method is based on the use of dideoxynucleotides. Structurally, the dideoxynucleotides are essentially the same as nucleotides except that they contain a hydrogen group (–H) on the 3′-carbon instead of a hydroxyl group (–OH). These dideoxynucleotides prevent the addition of further nucleotides because of their inability to form a phosphodiester bond with the next coming nucleotide and result in termination of the DNA chain formation. On the other hand, pyrosequencing is based on ‘sequencing by synthesis’ principle. It relies on the detection of pyrophosphate release on nucleotide incorporation, rather than on chain termination with dideoxynucleotides. Pyrosequencing forms the basis of high-throughput sequencing that parallelizes the sequencing process, producing millions of sequences at once. Various second-generation sequencers are currently running on the same chemistry such as 454 pyrosequencing, Illumina (Solexa sequencing), and SOLiD sequencing. With the advent of new-generation sequencing technologies, the cost of sequencing and the time required to actually perform sequencing have dropped significantly. This has opened up many opportunities in the field of genomics. Many projects have been undertaken to understand the genomics of many species including viruses, bacteria, fungi, plants, and animals.

Developing RFLP probes

  • Total DNA is digested with a methylation-sensitive enzyme (for example, PstI), thereby enriching the library for single- or low-copy expressed sequences (PstI clones are based on the suggestion that expressed genes are not methylated).
  • The digested DNA is size-fractionated on a preparative agarose gel, and fragments ranging from 500 to 2000 bp are excised, eluted and cloned into a plasmid vector (for example, pUC18).
  • Digests of the plasmids are screened to check for inserts.
  • Southern blots of the inserts can be probed with total sheared DNA to select clones that hybridize to single- and low-copy sequences.
  • The probes are screened for RFLPs using genomic DNA of different genotypes digested with restriction endonucleases. Typically, in species with moderate to high polymorphism rates, two to four restriction endonucleases are used such as EcoRI