How hard it is to determine a 3d structure of a protein?

How hard it is to determine a 3d structure of a protein?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I seeing tens of thousands of PDB files on the internet. I really want to determine a 3D structure of my protein of interest. I've heard that 3D structure determination is a complex, expensive, and specialized procedure that can take months or years and is hard to perform routinely or even purchase commercially.

Could you please explain why and how hard it is? What are the requirements and how big of a budget is typically needed to perform this kind of experiment?

Experimental protein structure determination is hard: the most common method is X-ray crystallography, which can be done in a few months if you are lucky and can take years if you're not. The problem with X-ray crystallography is that you need good protein crystals, and in most cases, proteins don't crystallize very well, so it takes a lot of time (and a lot of purified protein) to get the right crystal. (In the comments section Cody added some good links with more in-depth abut X-ray crystallography for structure determination relating to ATP synthase).

The cost is mainly based on the materials & manpower you need for this, and you need to test a lot of different solutions to crystallize your protein in, which aren't very cheap. If you don't have access to specialised equipment for making the crystals (mostly pipetting robots) or the measurement (X-ray beamline), this is going to be super expensive.

Crystallography requires protein crystals, which are sometimes too troublesome to obtain. Other methods that do not require crystals, such as NMR and cryoEM, are even more difficult to perform though less luck dependent. These techniques rely on very expensive equipment and often cannot resolve the structure as precisely as protein crystallography. So unless you can find one of the few people who have experience with these methods, you'll run into a lot of additional problems.

There is also computational structure prediction, which only needs a powerful computer. However, unless your protein is very similar to one with a known structure it will probably not work reliably. As mentioned in the comments there are web server's for the established methods and progress is constantly made on new(er) algorithms, so depending on your needs it's definitely worth a shot to try computational structure prediction.

I'll address NMR for structure determination. It is the less common method, only ~10% of protein structures are determined this way, though it has e.g. advantages for nucleic acids and more than a third of those was solved by NMR. Take any numbers here as very rough estimates, there are a lot of factors that influence the difficulty and cost.

For NMR, the single most important factor is size. A stable, well-behaved protein of 10-20 kD is pretty much routine. Large proteins are either very difficult to measure or outright impossible by NMR.

You need large quantities of protein for X-ray and NMR, but in the case of NMR you also need it isotope-labeled. Uniformly labelling proteins 15N/13C isn't all that expensive if you can express your protein in E. coli in minimal medium (around $100 per liter medium or so, mostly in 13C glucose), but it can get really expensive very fast if you need full medium or anything fancy.

Your protein needs to be stable in solution for a few days at least, a single 3D experiment can take multiple days to measure. If your protein is barely stable enough, you also have to produce much more protein because you need to use many samples to perform all necessary experiments. A complication here is that you can't add a lot of salt to an NMR sample without drastically reducing the sensitivity of the NMR experiments. But some proteins are hard to get stable in a low-salt environment.

You need something like a few weeks of measurement time on a high-field NMR spectrometer, though this depends a lot on the size of the protein and the concentration you can achieve in your sample and the general difficulty of assigning the protein resonances. This kind of spectrometer can cost a few millions, and you need to supply liquid nitrogen and liquid helium to keep them working. Something like a $1000 per day measurement time is a number I've heard about the cost, but there are a lot of factors that go into this and I can't say how accurate this number is.

Then you need to assign the resonances in your protein, which takes a single person several months. This again varies a lot depending on the difficulty.

Then you need to calculate the structure, and typically this requires several rounds of refining the assignment and analysis and redoing the calculations. It's not really expensive in terms of computer power, but it takes quite some work by the person running the calculations and checking them.

G4. Prediction of Membrane Protein Structure

  • Contributed by Henry Jakubowski
  • Professor (Chemistry) at College of St. Benedict/St. John's University

So far we have discussed predominantly globular proteins that are soluble in water. Proteins are also found associated with membranes. Two major classes of membrane proteins are found in nature.

  • peripheral membrane proteins: water soluble proteins bound reversibly and non-covalently to the membrane through electrostatic attractions between charged polar head groups of the phospholipids and the protein. These proteins can often be released from the membrane by addition of high salt, since they are often attracted to the bilayer by electrostatic interactions between charged phospholipid head groups and polar/charged groups on the protein surface.
  • integral membrane proteins: actually insert into the bilayer. These can be released from the membrane and effectively solubilized by the addition of single chain amphiphiles (detergents) which form a mixed micelle with the integral membrane protein. Nonionic detergents (Trition X-100, octylglucoside, etc) are often used in the purification of membrane proteins. Ionic detergents (like SDS) not only solubilize the integral membrane proteins, but also denature them.

Figure: Types of membrane proteins

In some of these integral membrane proteins, large extracellullar and intracellular domains of the protein are present, connected by the intramembrane regions. The intramembrane spanning region often consists of either a single alpha helix, or 7 different helical regions which zig-zag through the membrane. These transmembrane sequences can readily be determined through hydropathy calculations. For example, consider the integral membrane bovine protein rhodopsin. Its 348 amino acid sequence (in single letter code) is shown below:


Rhodopsin hydropathy plot calculations shows that is contains seven transmembrane helices which wind through the membrane in a serpentine fashion..

Figure: Rhodopsin hydropathy plot

Figure: seven transmembrane helices

Rhodopsin Hydropathy Results

No. N terminal transmembrane region C terminal type length

In summary, hydropathy plots are hence useful in finding buried regions in water soluble proteins, transmembrane helices in integral membrane proteins as well as short stretches of polar/charged amino acids that might form surface loops recognizable by immune system antibodies. The window size used in hydropathy plots would obviously affect the calculated results. Windows of 20 amino acids are useful to determine transmembrane helices while windows of 5-7 amino acids are used to find surface-exposed hydrophilic sites.

Membrane proteins call be solubilized by addition of single chain amphiphiles (detergents). The nonpolar tails of the detergents interact with the hydrophobic transmembrane domain of the membrane protein forming a "mixed" micelle-like structure. Nonionic detergents like Triton X-100 and octyl-glucoside are often used to solubilize membrane proteins in their near native state. In contrast, ionic detergents like sodium dedecyl sulfate (with a negatively charged head group) denature proteins during the solubilization process. To study membrane proteins in a more native-like environment, proteins solubilized by nonionic detergent can be reconstituted into bilayer liposome structures using methods similar to those from Lab 1 in which you prepared dye-capsulated large unilamellar vesicles (LUVs). However, it can be difficult to study the intra- and extracellular domains of membrane proteins in liposomes, given that one of those domains is hidden inside the liposome. A novel technique that removes this barrier was recently developed by Sligar. He created an amphiphilic protein disc with an opening in the center. The inner opening is lined with nonpolar residues, while the outer surface of the disc is polar. When the discs were added to phosphlipids, small bilayers formed inside the disc. Membrane proteins like the b-2 adrenergic receptor could be reconstituted in the nanodisc bilayers, allowing solvent exposure of both the intracellular and extracellular domains of the receptor protein.

Debora Marks

Associate Professor of Systems Biology
Marks Lab website

Contact Information
Email: [email protected]

Faculty Assistant: Kevin Chimo
[email protected]

One million human genomes, will it make a difference? The large and growing volume of genome information, from all forms of life, presents unprecedented opportunities for computational biologists. The challenge for our scientific generation is to turn an avalanche of sequence information into meaningful discovery of biological principles, predictive methods, or strategies for molecular manipulation for therapeutic and biofuel discovery. The Marks lab is a new interdisciplinary lab dedicated to developing rigorous computational approaches to critical challenges in biomedical research, particularly on the interpretation of genetic variation and its impact on basic science and clinical medicine. To address this we develop algorithmic approaches to biological data aimed at teasing out causality from correlative observations, an approach that has been surprisingly successful to date on notoriously hard problems. In particular, we developed methods adapted from statistical physics and graphical modeling to disentangle true contacts from observed evolutionary correlations of residues in protein sequences. Remarkably, these evolutionary couplings, identified from sequence alone, supplied enough information to fold a protein sequence into 3D. The software and methods we developed is available to the biological community on a public server that is quick and easy for non-experts to use. In this evolutionary approach to accurately we have predicted the 3D structure of hundreds of proteins and large pharmaceutically relevant membrane proteins. Many of these were previously of unknown structure and had no homology to known sequences two of the large membrane proteins have now been experimentally validated. We have now applied this approach genome wide to determine the 3D structure of all protein interactions that have sufficient sequences and can demonstrate the evolutionary signature of alternative conformations.

The vision for the Marks lab is to build computational methods that address three critical challenges (i) protein conformational plasticity in health and disease, (ii) genome-wide evaluation of mutations on disease likelihood, antibiotic resistance and personal drug response, and (iii) synthetic protein design.

About Dr. Marks: I am a computational biologist interested in how to read the genome and interpret its variation. Recently, we have used evolutionary couplings determined from genomic sequencing to accurately protein 3D structure from sequences alone, including the experimentally challenging transmembrane proteins. Continuing from this my lab aims to predict alternative conformations and plasticity of proteins, and the consequences of protein genetic variation on pharmacological intervention. In a complementary approach, we are examining on the effect of drugs on patients and cell lines by bringing together large bodies of data from multiple perturbations and thousands of cancer patient tissues.

Faster structures

An AlphaFold prediction helped to determine the structure of a bacterial protein that Lupas’s lab has been trying to crack for years. Lupas’s team had previously collected raw X-ray diffraction data, but transforming these Rorschach-like patterns into a structure requires some information about the shape of the protein. Tricks for getting this information, as well as other prediction tools, had failed. “The model from group 427 gave us our structure in half an hour, after we had spent a decade trying everything,” Lupas says.

Demis Hassabis, DeepMind’s co-founder and chief executive, says that the company plans to make AlphaFold useful so other scientists can employ it. (It previously published enough details about the first version of AlphaFold for other scientists to replicate the approach.) It can take AlphaFold days to come up with a predicted structure, which includes estimates on the reliability of different regions of the protein. “We’re just starting to understand what biologists would want,” adds Hassabis, who sees drug discovery and protein design as potential applications.

In early 2020, the company released predictions of the structures of a handful of SARS-CoV-2 proteins that hadn’t yet been determined experimentally. DeepMind’s predictions for a protein called Orf3a ended up being very similar to one later determined through cryo-EM, says Stephen Brohawn, a molecular neurobiologist at the University of California, Berkeley, whose team released the structure in June. “What they have been able to do is very impressive,” he adds.

Using Protein Sequences to Predict Structure

Proteins are typically cited as the molecules that enable life the word protein stems from the Greek proteois meaning “primary,” “in the lead,” or “standing in front.” Living systems are made up of a vast array of different proteins. There are around 50,000 different proteins encoded in the human genome, and in a single cell there may be as many as 20,000,000 copies of a single protein. 1

Each protein provides a fas­cinating example of a self-organ­izing system. The molecule is assembled as a chain of amino acid building blocks, which are bonded together by peptide bonds to form a linear polymer. Once synthesized, this polymer spontaneously self-assembles into the correct and highly ordered three-dimensional structure required for function. This ability to self-assemble is remarkable—each linear polypeptide chain is highly disorganized, and has the potential to adopt an array of conformations so vast that we cannot enumerate them, yet within less than a second a typical protein spontaneously assumes the correct, highly ordered three-dimensional structure required for function. The identity and order of the amino acids that make up this polypeptide, that is the protein sequence, typically contain all the information necessary to specify the folded functional molecule. 2

Figure 1: A) The amino acids (letters, second row of table) specified at each sequence position (numbered, top row of table) for a particular protein are synthesized into a polypeptide chain. B) The polymer chain spontaneously self-assembles into the complex three-dimensional structure specific to that protein that is required for the molecular function. C) Once folded, the protein is described as a monomer, and often different monomers or multiple copies of the same monomer self-assemble into protein complexes that form functional molecules.

We currently live in a hugely exciting time for the biological sciences, for the simple reason that technological advances have greatly increased our ability to accurately collect large amounts of data. In particular, over the last twenty years our ability to cheaply and precisely determine the sequences of proteins has vastly increased, leading to the assembly of large, freely accessible collections of protein sequences from different species. However, experimentally determining the three-dimensional structure of a protein is expensive and difficult, leading us to ask if we can use the sequence data available for each protein to predict its three-dimensional structure. The crucial point is that the sequence of a particular protein varies between different species. Hemoglobin (figure 1), the protein in our blood that binds and transports oxygen, provides a good example. Versions of hemoglobin from different species are very similar, both in their three-dimensional structure and in the function they carry out. How­ever, there are differences between the hemoglobin amino acid sequences that occur in different species. An exciting current direction of research is to exploit this evolutionary sequence variation and crack the code that relates amino acid sequence to protein structure and function. 3,4

Model based on sequence data
The basic idea is to use the abundance of protein sequence data that is now available to build a probability model for the amino acid sequence that codes for each protein of interest. The probability distribution for the sequence, P(A1, A2, . An), describes the probability that each of the twenty amino acids occurs at each of the positions 1. n in the sequence of n amino acids that makes up the protein of interest. If we are able to collect enough distinct data samples for a protein of interest, and we make certain assumptions about the mathematical form of the probability distribution, then we can use the data to infer the parameters of the model. For many proteins, upwards of 10,000 sequences are now available, a body of data that constitutes a set of samples from this probability distribution, though the elements of this set are not sampled independently of one another.

What form should the probability model take? The space of models that would generate the data observed for a particular protein is unbounded. However, we can use the knowledge about proteins collected by biologists over the last century to restrict our attention to particular classes of models. A process of selection on standing variation in different populations produces sequences of a particular protein across many species. Through the evolutionary process, mutations (amino acid changes) at different sequence positions within a protein are randomly generated. Some of these mutations lead to an improved version of the protein, increasing the fitness of the organism, and will therefore be selected. There is a high-dimensional space of possible sequences the sequences corresponding to a protein of interest occupy some subset of this space. Sequences collected in the database that code for a particular protein record the outcomes of millions of evolutionary experiments that probe the boundaries of this subset. The boundaries are imposed by the requirement for a protein to be functional the idea is to infer the boundaries, or constraints, on which sequences are allowed, and thus learn about the relationship between amino acid sequence and protein function.

A key point is that mutations at different sites do not necessarily have independent effects on the protein. In particular, it is often observed that while a single mutation at one position within a sequence results in a protein that is no longer functional, perhaps because it does not fold correctly, this disability can be rescued by a compensatory mutation that occurs elsewhere in the protein sequence. This suggests that we need to include interactions between different sequence positions in the probability model. Though the number of sequences available for many proteins is large, the space of possible parameters for a model that considers interactions of different orders is much larger, and so we restrict our attention to models that consider pairwise interactions between different sequence positions. We hypothesize that if two amino acids are in close proximity in the three-dimensional protein structure​—if they pack against each other, for example, or interact via a hydrogen bond—then their mutation pattern across different species may contain correlations, as in the toy model illustrated in figure 2. If this model is accurate, it suggests that it may be possible to use pairwise interactions to predict the spatial proximity of amino acids in the three-dimensional protein structure from sequence data.

Figure 2: A) The toy model illustrates an example where a mutation at one sequence position results in a protein that no longer folds or functions correctly. This loss of protein function is rescued by a compensatory mutation elsewhere in the sequence, restoring the ability of the molecule to fold and function. B) If this model is correct, it will mean that our sequence database only contains sequences with i) neither of the mutations, or with ii) both of the mutations, and hence the mutation pattern of the two columns that correspond to these two sequence positions will be correlated.

Experimentally determining the three-dimensional structure of a protein of interest is an expensive and time-consuming process, and for many transmembrane proteins (insoluble because they span the hydrophobic lipid bilayer that bounds a cell), it is not yet possible. The question of predicting the three-dimensional structure of a protein from its amino acid sequence has occupied scientists for at least the last fifty years. Part of the reason that this problem is rather intractable is the sheer number of possible conformations that each protein chain could in theory adopt. Each protein chain typically contains hundreds or in some cases thousands of amino acids.

To compute a rough estimate of the order of magnitude of the conformational search space, consider that each amino acid has two independent bond angles that describe its conformation within the context of a polypeptide chain, and add to this at least one extra degree of freedom describing the orientation of the amino acid’s side chain. Assuming, conservatively, that each degree of freedom can only take a restricted set of say 10 values, this provides a minimum of 10 3 =1000 different structural conformations per amino acid—that is (10 3 ) 150 possible configurations for a protein consisting of just 150 amino acids. Even if there is flexibility in the native structure—if amino acid side chains are able to rotate to some extent, for example—the native folded structure of a protein occupies just a tiny fraction of this enormous space, making it computationally intrac­table for any sort of brute force search approach. If the chain were able to sample conformations at a nanosecond or picosecond rate, it would still take a time longer than the age of the universe to find the correct native conformation (Levinthal’s paradox). 5 The fact that proteins manage to fold on biologically relevant timescales suggests that protein sequences are optimized by the evolutionary process to enable fast and reliable folding.

The shape of the energy landscape that enables the protein to spontaneously self-assemble into the correct structure in a matter of seconds is dictated by the physical interactions between different amino acids. Attempts have been made to use approximations of the physical interactions both between atoms of the protein and with atoms of the surrounding solvent to computationally simulate the protein-folding problem. While progress has been made, allowing the structures of some small proteins to be accurately predicted, the problem remains computationally intractable even with the use of coarse-grained approximations. Currently, we are unable to simulate more than a millisecond of protein dynamics, which prevents the simulation of folding trajectories for larger proteins.

The probability model
The question is how we can best use the available sequence data to infer the constraints or bounds on the space of amino acid sequences that result in the particular protein of interest. This is a typical inverse problem we wish to use the data to infer the model constraints, i.e., to parameterize the probability distribution. This raises the crucial question of what form the probability distribution should have. While we require that the probability model reproduce the statistics of the observed data, there are many models that will do this. We wish to choose a single model from among these, so we choose the maximum entropy model, i.e., the least constrained model that reproduces the observed data. Specifically, we ask for the single site marginals and the marginals for each pair of sites to match the empirical frequency counts for the single sites, and each pairs of sites, in the sequence alignment available for the protein of interest. The resulting Potts model, known from statistical physics, defines a global probability model on the space of protein sequences: Where eij(a,b) are called the couplings, and hi the fields.

Here is the partition function, which ensures that the probabilities are properly normalized.

A different model is built from data for each protein, and gives the probability that any sequence of interest will specify the protein for which the model is built. In short, the set of sequences available for a protein of interest is used to infer the parameters for this model.

One prediction is that those pairs of sequence positions that have the highest interaction score will be in close structural proximity in the three-dimensional protein structure. To test the predictive power of our inference procedure, we compare the highest scoring pairs of sequence positions to the experimentally solved crystal structure of an example protein. In figure 3, the match between the model predictions (red stars) and the crystal structure data (grey points) is shown to be excellent.

Figure 3: In grey is a two-dimensional map of the three-dimensional crystal structure data—this is a binary matrix where a ‘1’ represents two amino acids between which the distance in the crystal structure is less than 5 angstroms. The highest scoring amino acid pairs from the maximum entropy model inferred from sequence data are plotted in red. The fact that many of the red points coincide with grey points demonstrates that we can predict information about three-dimensional protein structure from sequence data.

This is a highly surprising result, which immediately raises the question of whether the information inferred from sequence data is sufficient to predict the three-dimensional protein structure. To test this hypothesis, we start with the unfolded polypeptide chain for the protein of interest, and use an algorithm called distance geometry to enforce that the two amino acids in each high-scoring pair are within 7 angstroms of each other in our structural model. Distance constraints that reflect the secondary structure predicted from protein sequence are also included.

Figure 4: Comparison of our predicted three-dimensional protein structure for the protein RAS with the crystal structure (grey). The root mean square deviation between the C∝-atoms of our predicted structure and the experimentally solved crystal structure is 3.5 angstroms, comparable to a low-resolution crystal structure.

We find that there is indeed sufficient information in these large sequence alignments to accurately predict protein three-dimensional structure. 6 This statistical method makes progress on the protein-folding problem by predicting, from sets of protein sequences, structures for a range of globular and transmembrane proteins. In addition, three-dimensional structures were predicted using these methods for a set of transmembrane proteins for which no experimental structure had yet been solved. Many of these
proteins are important in human diseases, and the existence of predicted structures will allow models of their function to be constructed and potentially validated. While the ability to use evolutionary variation to shed light on protein structure is exciting, the fundamental question of the relationship between amino acid sequence and protein function requires further work. In particular, it will be important to understand how the concerted actions of groups of amino acids within a protein result in different protein phenotypes, and fur­ther­more how these can be predicted from large collections of protein sequences. 3,4

Lucy Colwell, Visitor (2013) and Member (2012–13) in the School of Natural Sciences, is Assistant Professor at the University of Cambridge. She is interested in using and developing mathematical techniques to better understand the relationship between biological sequence and phenotype, in particular at the level of proteins and protein complexes.

1. Beck, Martin, et al. “The Quantitative Proteome of a Human Cell Line,” Molecular Systems Biology 7.1 (2011).
2. Anfinsen, Christian B., et al. “The Kinetics of Formation of Native Ribonuclease during Oxidation of the Reduced Polypeptide Chain,” Proceedings of the National Academy of Sciences, 47.9 1309 (1961).
3. Skerker J. M., Perchuk B. S., Siryaporn A., Lubin E. A., Ashenberg O., et al. “Rewiring the Specificity of Two-Component Signal Transduction Systems,” Cell 133: 1043–54 (2008).
4. Halabi N., Rivoire O., Leibler S., Ranganathan R. “Protein Sectors: Evolutionary Units of Three-Dimensional Structure,” Cell 138: 774–86 (2009).
5. Levinthal, Cyrus. “How to Fold Graciously,” Mossbauer Spectroscopy in Biological Systems 22–24 (1969).
6. Marks Debora S., Colwell Lucy J., et al. “Protein 3D Structure Computed from Evolutionary Sequence Variation,” PLOS One 6.12: e28766 (2011).

Denaturation and Protein Folding

Each protein has its own unique sequence and shape that are held together by chemical interactions. If the protein is subject to changes in temperature, pH, or exposure to chemicals, the protein structure may change. This could result in the protein losing its shape without losing its primary sequence. We refer to this phenomenon as denaturation.

Reversible and Irreversible Denaturation

Denaturation is often reversible. This is because the primary structure of the polypeptide is conserved in the process if the denaturing agent is removed, allowing the protein to resume its function. Sometimes denaturation is irreversible, leading to loss of function. One example of irreversible protein denaturation is when an egg is fried. The albumin protein in the liquid egg white is denatured when placed in a hot pan (see image below).

(Top) albumin in raw and cooked egg white (Bottom) an analogy to help visualize the process of protein denaturation. Image Attribution: Wikimedia Commons (CC BY-SA 3.0)

Not all proteins are denatured at high temperatures. For instance, bacteria that survive in hot springs have proteins that function at temperatures close to boiling. The stomach is also very acidic, has a low pH, and denatures proteins as part of the digestion process. However, the digestive enzymes of the stomach retain their activity under these conditions.

Is denatured protein still good?

The peptide bonds that are present in protein are broken. However, despite the change in the structure of the protein, denatured protein still contains all of the amino acids that are found in other forms of the protein. As a result, denatured proteins are still nutritionally beneficial.

Protein Folding

Protein before and after folding. Image Attribution: Wikimedia Commons (public domain)

Protein folding is critical to its function. It was originally thought that the proteins themselves were responsible for the folding process. Only recently was it found that often they receive assistance in the folding process from protein helpers that associate with the target protein during the folding process. We refer to this protein helpers as known as chaperones (or chaperonins). They act by preventing aggregation of polypeptides that make up the complete protein structure, and they disassociate from the protein once the target protein is folded.


  • As a polypeptide is being synthesized, it emerges (N-terminal first) from the ribosome and the folding process begins.
  • However, the emerging polypeptide finds itself surrounded by the watery cytosol and many other proteins.
  • As hydrophobic amino acids appear, they must find other hydrophobic amino acids to associate with. Ideally, these should be their own, but there is the danger that they could associate with nearby proteins instead &mdash leading to aggregation and a failure to form the proper tertiary structure.

To avoid this problem, the cells of all organisms contain molecular chaperones that stabilize newly-formed polypeptides while they fold into their proper structure. The chaperones use the energy of ATP to do this work.


Some proteins are so complex that a subset of molecular chaperones &mdash called chaperonins &mdash is needed.

Chaperonins are hollow cylinders into which the newly-synthesized protein fits while it folds.

Chaperonins also use ATP as the energy source to drive the folding process.

As mentioned above, high temperatures can denature proteins, and when a cell is exposed to high temperatures, several types of molecular chaperones swing into action. For this reason, these chaperones are also called heat-shock proteins (HSPs).

Not only do molecular chaperones assist in the folding of newly-synthesized proteins, but some of them can also unfold aggregated proteins and then refold the protein properly. Protein aggregation is the cause of disorders such as Alzheimer's disease, Huntington's disease, and prion diseases (e.g., "mad-cow" disease). Perhaps some day ways will be found to treat these diseases by increasing the efficiency of disaggregating chaperones.

Despite the importance of chaperones, the rule still holds: the final shape of a protein is determined by only one thing: the precise sequence of amino acids in the protein.

And the sequence of amino acids in every protein is dictated by the sequence of nucleotides in the gene encoding that protein. So the function of each of the thousands of proteins in an organism is specified by one or more genes.

AI has almost solved one of biology’s greatest challenges — how protein unfolds

A simple chain of amino acids folds into a complex three-dimensional structure | Marc Zimmer

S olving what biologists call “the protein-folding problem” is a big deal. Proteins are the workhorses of cells and are present in all living organisms. They are made up of long chains of amino acids and are vital for the structure of cells and communication between them as well as regulating all of the chemistry in the body.

This week, the Google-owned artificial intelligence company DeepMind demonstrated a deep-learning program called AlphaFold2, which experts are calling a breakthrough toward solving the grand challenge of protein folding.

Proteins are long chains of amino acids linked together like beads on a string. But for a protein to do its job in the cell, it must “fold” – a process of twisting and bending that transforms the molecule into a complex three-dimensional structure that can interact with its target in the cell. If the folding is disrupted, then the protein won’t form the correct shape – and it won’t be able to perform its job inside the body. This can lead to disease – as is the case in a common disease like Alzheimer’s, and rare ones like cystic fibrosis.

Deep learning is a computational technique that uses the often hidden information contained in vast datasets to solve questions of interest. It’s been used widely in fields such as games, speech and voice recognition, autonomous cars, science and medicine.

I believe that tools like AlphaFold2 will help scientists to design new types of proteins, ones that may, for example, help break down plastics and fight future viral pandemics and disease.

I am a computational chemist and author of the book The State of Science. My students and I study the structure and properties of fluorescent proteins using protein-folding computer programs based on classical physics.

After decades of study by thousands of research groups, these protein-folding prediction programs are very good at calculating structural changes that occur when we make small alterations to known molecules.

But they haven’t adequately managed to predict how proteins fold from scratch. Before deep learning came along, the protein-folding problem seemed impossibly hard, and it seemed poised to frustrate computational chemists for many decades to come.

A chain of amino acids goes through several folding steps, which occurs through hydrogen bonds between amino acids in different regions of the protein, before arriving at the final structure. The example shown here is hemoglobin, a protein in red blood cells that transports oxygen to body tissues.
Anatomy & Physiology, Connexions website, CC BY

Protein folding

The sequence of the amino acids – which is encoded in DNA – defines the protein’s 3D shape. The shape determines its function. If the structure of the protein changes, it is unable to perform its function. Correctly predicting protein folds based on the amino acid sequence could revolutionize drug design, and explain the causes of new and old diseases.

All proteins with the same sequence of amino acid building blocks fold into the same three-dimensional form, which optimizes the interactions between the amino acids. They do this within milliseconds, although they have an astronomical number of possible configurations available to them – about 10 to the power of 300. This massive number is what makes it hard to predict how a protein folds even when scientists know the full sequence of amino acids that go into making it. Previously predicting the structure of protein from the amino acid sequence was impossible. Protein structures were experimentally determined, a time-consuming and expensive endeavor.

Once researchers can better predict how proteins fold, they’ll be able to better understand how cells function and how misfolded proteins cause disease. Better protein prediction tools will also help us design drugs that can target a particular topological region of a protein where chemical reactions take place.

AlphaFold is born from deep-learning chess, Go and poker games

The success of DeepMind’s protein-folding prediction program, called AlphaFold, is not unexpected. Other deep-learning programs written by DeepMind have demolished the world’s best chess, Go and poker players.

In 2016 Stockfish-8, an open-source chess engine, was the world’s computer chess champion. It evaluated 70 million chess positions per second and had centuries of accumulated human chess strategies and decades of computer experience to draw upon. It played efficiently and brutally, mercilessly beating all its human challengers without an ounce of finesse. Enter deep learning.

On Dec. 7, 2017, Google’s deep-learning chess program AlphaZero thrashed Stockfish-8. The chess engines played 100 games, with AlphaZero winning 28 and tying 72. It didn’t lose a single game. AlphaZero did only 80,000 calculations per second, as opposed to Stockfish-8’s 70 million calculations, and it took just four hours to learn chess from scratch by playing against itself a few million times and optimizing its neural networks as it learned from its experience.

AlphaZero didn’t learn anything from humans or chess games played by humans. It taught itself and, in the process, derived strategies never seen before. In a commentary in Science magazine, former world chess champion Garry Kasparov wrote that by learning from playing itself, AlphaZero developed strategies that “reflect the truth” of chess rather than reflecting “the priorities and prejudices” of the programmers. “It’s the embodiment of the cliché ‘work smarter, not harder.’”

How do proteins fold?

CASP – the Olympics for molecular modelers

Every two years, the world’s top computational chemists test the abilities of their programs to predict the folding of proteins and compete in the Critical Assessment of Structure Prediction (CASP) competition.

In the competition, teams are given the linear sequence of amino acids for about 100 proteins for which the 3D shape is known but hasn’t yet been published they then have to compute how these sequences would fold. In 2018 AlphaFold, the deep-learning rookie at the competition, beat all the traditional programs – but barely.

Two years later, on Monday, it was announced that Alphafold2 had won the 2020 competition by a healthy margin. It whipped its competitors, and its predictions were comparable to the existing experimental results determined through gold standard techniques like X-ray diffraction crystallography and cryo-electron microscopy. Soon I expect AlphaFold2 and its progeny will be the methods of choice to determine protein structures before resorting to experimental techniques that require painstaking, laborious work on expensive instrumentation.

One of the reasons for AlphaFold2’s success is that it could use the Protein Database, which has over 170,000 experimentally determined 3D structures, to train itself to calculate the correctly folded structures of proteins.

The potential impact of AlphaFold can be appreciated if one compares the number of all published protein structures – approximately 170,000 – with the 180 million DNA and protein sequences deposited in the Universal Protein Database. AlphaFold will help us sort through treasure troves of DNA sequences hunting for new proteins with unique structures and functions.

Has AlphaFold made me, a molecular modeler, redundant?

As with the chess and Go programs – AlphaZero and AlphaGo – we don’t exactly know what the AlphaFold2 algorithm is doing and why it uses certain correlations, but we do know that it works.

Besides helping us predict the structures of important proteins, understanding AlphaFold’s “thinking” will also help us gain new insights into the mechanism of protein folding.

One of the most common fears expressed about AI is that it will lead to large-scale unemployment. AlphaFold still has a significant way to go before it can consistently and successfully predict protein folding.

However, once it has matured and the program can simulate protein folding, computational chemists will be integrally involved in improving the programs, trying to understand the underlying correlations used, and applying the program to solve important problems such as the protein misfolding associated with many diseases such as Alzheimer’s, Parkinson’s, cystic fibrosis and Huntington’s disease.

AlphaFold and its offspring will certainly change the way computational chemists work, but it won’t make them redundant. Other areas won’t be as fortunate. In the past robots were able to replace humans doing manual labor with AI, our cognitive skills are also being challenged.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Subscribe to our channels on YouTube & Telegram

Why news media is in crisis & How you can fix it

India needs free, fair, non-hyphenated and questioning journalism even more as it faces multiple crises.

But the news media is in a crisis of its own. There have been brutal layoffs and pay-cuts. The best of journalism is shrinking, yielding to crude prime-time spectacle.

ThePrint has the finest young reporters, columnists and editors working for it. Sustaining journalism of this quality needs smart and thinking people like you to pay for it. Whether you live in India or overseas, you can do it here.

Has AlphaFold made me, a molecular modeler, redundant?

As with the chess and Go programs – AlphaZero and AlphaGo – we don’t exactly know what the AlphaFold2 algorithm is doing and why it uses certain correlations, but we do know that it works.

Besides helping us predict the structures of important proteins, understanding AlphaFold’s “thinking” will also help us gain new insights into the mechanism of protein folding.

One of the most common fears expressed about AI is that it will lead to large-scale unemployment. AlphaFold still has a significant way to go before it can consistently and successfully predict protein folding.

However, once it has matured and the program can simulate protein folding, computational chemists will be integrally involved in improving the programs, trying to understand the underlying correlations used, and applying the program to solve important problems such as the protein misfolding associated with many diseases such as Alzheimer’s, Parkinson’s, cystic fibrosis and Huntington’s disease.

AlphaFold and its offspring will certainly change the way computational chemists work, but it won’t make them redundant. Other areas won’t be as fortunate. In the past robots were able to replace humans doing manual labor with AI, our cognitive skills are also being challenged.

Watch the video: Structure determination of proteins by X ray Crystallography (August 2022).