Are there any intersections between the biochemical pathways of uracil biosynthesis (or metabolism) and methionine degradation in eukaryotes?

Are there any intersections between the biochemical pathways of uracil biosynthesis (or metabolism) and methionine degradation in eukaryotes?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Is there any way in which uracil biosynthesis or metabolism could help cell in degradation of toxic levels of methionine?

No, I don't think there are any intersections, and I can't see how uracil metabolism would have any influence upon methionine degradation.

Methionine degradation is shown in Figure 1 below. The end product is succinyl CoA which is fed into the TCA cycle. The only obvious by-product is cysteine which doesn't have any connection with uracil biosynthesis.

Uracil biosynthesis (as UMP) is shown in Figure 2: it is built from CO2 and NH3. Uracil degradation produces CO2 NH3 and aspartate.

There are certainly no shared intermediates in these pathways, and I can't come up with any convincing indirect connections either.

Figure 1. Methionine degradation

Figure 2. Uracil biosynthesis


Systems metabolic engineering, which integrated systems biology, synthetic biology, and evolutionary engineering with traditional metabolic engineering, is facilitating the development of high performance strains.

More diverse microorganisms are being used as production host strains, supported by the new genetic tools and strategies.

Recent advances in biosynthetic/semisynthetic design strategies are expanding the portfolio of products that can be produced biologically.

Evolutionary engineering tools and strategies are facilitating the improvement of strain and enzyme performances.

Advances in tools and strategies of omics, in silico metabolic simulation, genetic and genomic engineering, and high-throughput screening are accelerating optimization of metabolic fluxes for the enhanced production of target bioproducts.

Metabolic engineering allows development of microbial strains efficiently producing chemicals and materials, but it requires much time, effort, and cost to make the strains industrially competitive. Systems metabolic engineering, which integrates tools and strategies of systems biology, synthetic biology, and evolutionary engineering with traditional metabolic engineering, has recently been used to facilitate development of high-performance strains. The past decade has witnessed this interdisciplinary strategy continuously being improved toward the development of industrially competitive overproducer strains. In this article, current trends in systems metabolic engineering including tools and strategies are reviewed, focusing on recent developments in selection of host strains, metabolic pathway reconstruction, tolerance enhancement, and metabolic flux optimization. Also, future challenges and prospects are discussed.


At its most comprehensive definition, biochemistry can be seen as a study of the components and composition of living things and how they come together to become life. In this sense, the history of biochemistry may therefore go back as far as the ancient Greeks. [10] However, biochemistry as a specific scientific discipline began sometime in the 19th century, or a little earlier, depending on which aspect of biochemistry is being focused on. Some argued that the beginning of biochemistry may have been the discovery of the first enzyme, diastase (now called amylase), in 1833 by Anselme Payen, [11] while others considered Eduard Buchner's first demonstration of a complex biochemical process alcoholic fermentation in cell-free extracts in 1897 to be the birth of biochemistry. [12] [13] [14] Some might also point as its beginning to the influential 1842 work by Justus von Liebig, Animal chemistry, or, Organic chemistry in its applications to physiology and pathology, which presented a chemical theory of metabolism, [10] or even earlier to the 18th century studies on fermentation and respiration by Antoine Lavoisier. [15] [16] Many other pioneers in the field who helped to uncover the layers of complexity of biochemistry have been proclaimed founders of modern biochemistry. Emil Fischer, who studied the chemistry of proteins, [17] and F. Gowland Hopkins, who studied enzymes and the dynamic nature of biochemistry, represent two examples of early biochemists. [18]

The term "biochemistry" itself is derived from a combination of biology and chemistry. In 1877, Felix Hoppe-Seyler used the term (biochemie in German) as a synonym for physiological chemistry in the foreword to the first issue of Zeitschrift für Physiologische Chemie (Journal of Physiological Chemistry) where he argued for the setting up of institutes dedicated to this field of study. [19] [20] The German chemist Carl Neuberg however is often cited to have coined the word in 1903, [21] [22] [23] while some credited it to Franz Hofmeister. [24]

It was once generally believed that life and its materials had some essential property or substance (often referred to as the "vital principle") distinct from any found in non-living matter, and it was thought that only living beings could produce the molecules of life. [26] In 1828, Friedrich Wöhler published a paper on his serendipitous urea synthesis from potassium cyanate and ammonium sulfate some regarded that as a direct overthrow of vitalism and the establishment of organic chemistry. [27] [28] However, the Wöhler synthesis has sparked controversy as some reject the death of vitalism at his hands. [29] Since then, biochemistry has advanced, especially since the mid-20th century, with the development of new techniques such as chromatography, X-ray diffraction, dual polarisation interferometry, NMR spectroscopy, radioisotopic labeling, electron microscopy and molecular dynamics simulations. These techniques allowed for the discovery and detailed analysis of many molecules and metabolic pathways of the cell, such as glycolysis and the Krebs cycle (citric acid cycle), and led to an understanding of biochemistry on a molecular level.

Another significant historic event in biochemistry is the discovery of the gene, and its role in the transfer of information in the cell. In the 1950s, James D. Watson, Francis Crick, Rosalind Franklin and Maurice Wilkins were instrumental in solving DNA structure and suggesting its relationship with the genetic transfer of information. [30] In 1958, George Beadle and Edward Tatum received the Nobel Prize for work in fungi showing that one gene produces one enzyme. [31] In 1988, Colin Pitchfork was the first person convicted of murder with DNA evidence, which led to the growth of forensic science. [32] More recently, Andrew Z. Fire and Craig C. Mello received the 2006 Nobel Prize for discovering the role of RNA interference (RNAi), in the silencing of gene expression. [33]

Around two dozen chemical elements are essential to various kinds of biological life. Most rare elements on Earth are not needed by life (exceptions being selenium and iodine), [34] while a few common ones (aluminum and titanium) are not used. Most organisms share element needs, but there are a few differences between plants and animals. For example, ocean algae use bromine, but land plants and animals seem to need none. All animals require sodium, but some plants do not. Plants need boron and silicon, but animals may not (or may need ultra-small amounts).

Just six elements—carbon, hydrogen, nitrogen, oxygen, calcium and phosphorus—make up almost 99% of the mass of living cells, including those in the human body (see composition of the human body for a complete list). In addition to the six major elements that compose most of the human body, humans require smaller amounts of possibly 18 more. [35]

The 4 main classes of molecules in bio-chemistry (often called biomolecules) are carbohydrates, lipids, proteins, and nucleic acids. [36] Many biological molecules are polymers: in this terminology, monomers are relatively small macromolecules that are linked together to create large macromolecules known as polymers. When monomers are linked together to synthesize a biological polymer, they undergo a process called dehydration synthesis. Different macromolecules can assemble in larger complexes, often needed for biological activity.

Carbohydrates Edit

Two of the main functions of carbohydrates are energy storage and providing structure. One of the common sugars known as glucose is carbohydrate, but not all carbohydrates are sugars. There are more carbohydrates on Earth than any other known type of biomolecule they are used to store energy and genetic information, as well as play important roles in cell to cell interactions and communications.

The simplest type of carbohydrate is a monosaccharide, which among other properties contains carbon, hydrogen, and oxygen, mostly in a ratio of 1:2:1 (generalized formula CnH2nOn, where n is at least 3). Glucose (C6H12O6) is one of the most important carbohydrates others include fructose (C6H12O6), the sugar commonly associated with the sweet taste of fruits, [37] [a] and deoxyribose (C5H10O4), a component of DNA. A monosaccharide can switch between acyclic (open-chain) form and a cyclic form. The open-chain form can be turned into a ring of carbon atoms bridged by an oxygen atom created from the carbonyl group of one end and the hydroxyl group of another. The cyclic molecule has a hemiacetal or hemiketal group, depending on whether the linear form was an aldose or a ketose. [38]

In these cyclic forms, the ring usually has 5 or 6 atoms. These forms are called furanoses and pyranoses, respectively—by analogy with furan and pyran, the simplest compounds with the same carbon-oxygen ring (although they lack the carbon-carbon double bonds of these two molecules). For example, the aldohexose glucose may form a hemiacetal linkage between the hydroxyl on carbon 1 and the oxygen on carbon 4, yielding a molecule with a 5-membered ring, called glucofuranose. The same reaction can take place between carbons 1 and 5 to form a molecule with a 6-membered ring, called glucopyranose. Cyclic forms with a 7-atom ring called heptoses are rare.

Two monosaccharides can be joined together by a glycosidic or ether bond into a disaccharide through a dehydration reaction during which a molecule of water is released. The reverse reaction in which the glycosidic bond of a disaccharide is broken into two monosaccharides is termed hydrolysis. The best-known disaccharide is sucrose or ordinary sugar, which consists of a glucose molecule and a fructose molecule joined together. Another important disaccharide is lactose found in milk, consisting of a glucose molecule and a galactose molecule. Lactose may be hydrolysed by lactase, and deficiency in this enzyme results in lactose intolerance.

When a few (around three to six) monosaccharides are joined, it is called an oligosaccharide (oligo- meaning "few"). These molecules tend to be used as markers and signals, as well as having some other uses. [39] Many monosaccharides joined together form a polysaccharide. They can be joined together in one long linear chain, or they may be branched. Two of the most common polysaccharides are cellulose and glycogen, both consisting of repeating glucose monomers. Cellulose is an important structural component of plant's cell walls and glycogen is used as a form of energy storage in animals.

Sugar can be characterized by having reducing or non-reducing ends. A reducing end of a carbohydrate is a carbon atom that can be in equilibrium with the open-chain aldehyde (aldose) or keto form (ketose). If the joining of monomers takes place at such a carbon atom, the free hydroxy group of the pyranose or furanose form is exchanged with an OH-side-chain of another sugar, yielding a full acetal. This prevents opening of the chain to the aldehyde or keto form and renders the modified residue non-reducing. Lactose contains a reducing end at its glucose moiety, whereas the galactose moiety forms a full acetal with the C4-OH group of glucose. Saccharose does not have a reducing end because of full acetal formation between the aldehyde carbon of glucose (C1) and the keto carbon of fructose (C2).

Lipids Edit

Lipids comprise a diverse range of molecules and to some extent is a catchall for relatively water-insoluble or nonpolar compounds of biological origin, including waxes, fatty acids, fatty-acid derived phospholipids, sphingolipids, glycolipids, and terpenoids (e.g., retinoids and steroids). Some lipids are linear, open-chain aliphatic molecules, while others have ring structures. Some are aromatic (with a cyclic [ring] and planar [flat] structure) while others are not. Some are flexible, while others are rigid.

Lipids are usually made from one molecule of glycerol combined with other molecules. In triglycerides, the main group of bulk lipids, there is one molecule of glycerol and three fatty acids. Fatty acids are considered the monomer in that case, and may be saturated (no double bonds in the carbon chain) or unsaturated (one or more double bonds in the carbon chain).

Most lipids have some polar character in addition to being largely nonpolar. In general, the bulk of their structure is nonpolar or hydrophobic ("water-fearing"), meaning that it does not interact well with polar solvents like water. Another part of their structure is polar or hydrophilic ("water-loving") and will tend to associate with polar solvents like water. This makes them amphiphilic molecules (having both hydrophobic and hydrophilic portions). In the case of cholesterol, the polar group is a mere –OH (hydroxyl or alcohol). In the case of phospholipids, the polar groups are considerably larger and more polar, as described below.

Lipids are an integral part of our daily diet. Most oils and milk products that we use for cooking and eating like butter, cheese, ghee etc., are composed of fats. Vegetable oils are rich in various polyunsaturated fatty acids (PUFA). Lipid-containing foods undergo digestion within the body and are broken into fatty acids and glycerol, which are the final degradation products of fats and lipids. Lipids, especially phospholipids, are also used in various pharmaceutical products, either as co-solubilisers (e.g., in parenteral infusions) or else as drug carrier components (e.g., in a liposome or transfersome).

Proteins Edit

Proteins are very large molecules—macro-biopolymers—made from monomers called amino acids. An amino acid consists of an alpha carbon atom attached to an amino group, –NH2, a carboxylic acid group, –COOH (although these exist as –NH3 + and –COO − under physiologic conditions), a simple hydrogen atom, and a side chain commonly denoted as "–R". The side chain "R" is different for each amino acid of which there are 20 standard ones. It is this "R" group that made each amino acid different, and the properties of the side-chains greatly influence the overall three-dimensional conformation of a protein. Some amino acids have functions by themselves or in a modified form for instance, glutamate functions as an important neurotransmitter. Amino acids can be joined via a peptide bond. In this dehydration synthesis, a water molecule is removed and the peptide bond connects the nitrogen of one amino acid's amino group to the carbon of the other's carboxylic acid group. The resulting molecule is called a dipeptide, and short stretches of amino acids (usually, fewer than thirty) are called peptides or polypeptides. Longer stretches merit the title proteins. As an example, the important blood serum protein albumin contains 585 amino acid residues. [42]

Proteins can have structural and/or functional roles. For instance, movements of the proteins actin and myosin ultimately are responsible for the contraction of skeletal muscle. One property many proteins have is that they specifically bind to a certain molecule or class of molecules—they may be extremely selective in what they bind. Antibodies are an example of proteins that attach to one specific type of molecule. Antibodies are composed of heavy and light chains. Two heavy chains would be linked to two light chains through disulfide linkages between their amino acids. Antibodies are specific through variation based on differences in the N-terminal domain. [43]

The enzyme-linked immunosorbent assay (ELISA), which uses antibodies, is one of the most sensitive tests modern medicine uses to detect various biomolecules. Probably the most important proteins, however, are the enzymes. Virtually every reaction in a living cell requires an enzyme to lower the activation energy of the reaction. [12] These molecules recognize specific reactant molecules called substrates they then catalyze the reaction between them. By lowering the activation energy, the enzyme speeds up that reaction by a rate of 10 11 or more [12] a reaction that would normally take over 3,000 years to complete spontaneously might take less than a second with an enzyme. The enzyme itself is not used up in the process and is free to catalyze the same reaction with a new set of substrates. Using various modifiers, the activity of the enzyme can be regulated, enabling control of the biochemistry of the cell as a whole. [12]

The structure of proteins is traditionally described in a hierarchy of four levels. The primary structure of a protein consists of its linear sequence of amino acids for instance, "alanine-glycine-tryptophan-serine-glutamate-asparagine-glycine-lysine-…". Secondary structure is concerned with local morphology (morphology being the study of structure). Some combinations of amino acids will tend to curl up in a coil called an α-helix or into a sheet called a β-sheet some α-helixes can be seen in the hemoglobin schematic above. Tertiary structure is the entire three-dimensional shape of the protein. This shape is determined by the sequence of amino acids. In fact, a single change can change the entire structure. The alpha chain of hemoglobin contains 146 amino acid residues substitution of the glutamate residue at position 6 with a valine residue changes the behavior of hemoglobin so much that it results in sickle-cell disease. Finally, quaternary structure is concerned with the structure of a protein with multiple peptide subunits, like hemoglobin with its four subunits. Not all proteins have more than one subunit. [44]

Ingested proteins are usually broken up into single amino acids or dipeptides in the small intestine and then absorbed. They can then be joined to form new proteins. Intermediate products of glycolysis, the citric acid cycle, and the pentose phosphate pathway can be used to form all twenty amino acids, and most bacteria and plants possess all the necessary enzymes to synthesize them. Humans and other mammals, however, can synthesize only half of them. They cannot synthesize isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, and valine. Because they must be ingested, these are the essential amino acids. Mammals do possess the enzymes to synthesize alanine, asparagine, aspartate, cysteine, glutamate, glutamine, glycine, proline, serine, and tyrosine, the nonessential amino acids. While they can synthesize arginine and histidine, they cannot produce it in sufficient amounts for young, growing animals, and so these are often considered essential amino acids.

If the amino group is removed from an amino acid, it leaves behind a carbon skeleton called an α-keto acid. Enzymes called transaminases can easily transfer the amino group from one amino acid (making it an α-keto acid) to another α-keto acid (making it an amino acid). This is important in the biosynthesis of amino acids, as for many of the pathways, intermediates from other biochemical pathways are converted to the α-keto acid skeleton, and then an amino group is added, often via transamination. The amino acids may then be linked together to form a protein.

A similar process is used to break down proteins. It is first hydrolyzed into its component amino acids. Free ammonia (NH3), existing as the ammonium ion (NH4+) in blood, is toxic to life forms. A suitable method for excreting it must therefore exist. Different tactics have evolved in different animals, depending on the animals' needs. Unicellular organisms simply release the ammonia into the environment. Likewise, bony fish can release the ammonia into the water where it is quickly diluted. In general, mammals convert the ammonia into urea, via the urea cycle.

In order to determine whether two proteins are related, or in other words to decide whether they are homologous or not, scientists use sequence-comparison methods. Methods like sequence alignments and structural alignments are powerful tools that help scientists identify homologies between related molecules. The relevance of finding homologies among proteins goes beyond forming an evolutionary pattern of protein families. By finding how similar two protein sequences are, we acquire knowledge about their structure and therefore their function.

Nucleic acids Edit

Nucleic acids, so-called because of their prevalence in cellular nuclei, is the generic name of the family of biopolymers. They are complex, high-molecular-weight biochemical macromolecules that can convey genetic information in all living cells and viruses. [2] The monomers are called nucleotides, and each consists of three components: a nitrogenous heterocyclic base (either a purine or a pyrimidine), a pentose sugar, and a phosphate group. [45]

The most common nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). The phosphate group and the sugar of each nucleotide bond with each other to form the backbone of the nucleic acid, while the sequence of nitrogenous bases stores the information. The most common nitrogenous bases are adenine, cytosine, guanine, thymine, and uracil. The nitrogenous bases of each strand of a nucleic acid will form hydrogen bonds with certain other nitrogenous bases in a complementary strand of nucleic acid (similar to a zipper). Adenine binds with thymine and uracil, thymine binds only with adenine, and cytosine and guanine can bind only with one another. Adenine and Thymine & Adenine and Uracil contains two hydrogen Bonds, while Hydrogen Bonds formed between cytosine and guanine are three in number.

Aside from the genetic material of the cell, nucleic acids often play a role as second messengers, as well as forming the base molecule for adenosine triphosphate (ATP), the primary energy-carrier molecule found in all living organisms. Also, the nitrogenous bases possible in the two nucleic acids are different: adenine, cytosine, and guanine occur in both RNA and DNA, while thymine occurs only in DNA and uracil occurs in RNA.

Carbohydrates as energy source Edit

Glucose is an energy source in most life forms. For instance, polysaccharides are broken down into their monomers by enzymes (glycogen phosphorylase removes glucose residues from glycogen, a polysaccharide). Disaccharides like lactose or sucrose are cleaved into their two component monosaccharides.


Cysteine (Cys) the primary sulfur-containing amino acid (SAA) is a semiessential amino acid (AA) because it can be obtained from the diet or produced from methionine degradation via the transsulfuration pathway. In the mammalian diet, cysteine is considered as representative of SAAs (Bin, Huang, & Zhou, 2017 ). Cysteine belongs to a group of amino acids (AAs) which contain polar and uncharged R group which is more hydrophilic than AAs bearing nonpolar side chain. Cysteine undergoes oxidation at thiol group (–SH) which has the ability to form a covalent bond by reacting with free radicals and other groups, for example, cysteine linked by disulfur bridge. This bridge is stronger than hydrogen bonds (H–bonds), Van der Waals forces, and salt bridge (bond between electrically charged acidic and basic groups, especially on a protein) but weaker than peptide bonds. The most abundant form of cysteine in our body is L-cysteine. Cysteine is synthesized in our body from methionine (sulfur-containing essential amino acid) which is abundant in cheese, yogurt, meat, chicken, turkey, wheat gums, beef, and nuts (Sameem, Khan, & Niaz, 2019 ).

Homocysteine (Hcy) is also a sulfur-containing amino acid-like cysteine and methionine. Hcy is an essential AAS with a molar mass of 135.18 g/mol and formed during the conversion of methionine to cysteine. In humans, the only pathway for the biosynthesis of Hcy is from methionine (Ntaios, 2015 ). Hcy was discovered in 1932 by Butz and du Vigneaud when they heated methionine in sulfuric acid and obtained a substance with features similar to cysteine and named it “homocysteine (Hcy)” because it was a homolog of cysteine (Tsiami & Obersby, 2017 ). Hcy obtained through the methionine cycle as an intermediate product is catabolized through the transsulfuration pathway into cysteine (Ostrakhovitch & Tab ibzadeh, 2015 ). Hcy exists in protein-bound Hcy and free Hcy forms, and some of these two are referred to as total Hcy (tHcy). Hcy cannot be obtained from the diet since it is produced in the body from methionine which acts as a precursor of Hcy (Tsiami & Obersby, 2017 ).

1.1 Metabolism of cysteine and HCY

Within the body, cysteine is synthesized in the liver from Hcy by transmethylation of methionine. First, Hcy is condensed with serine by cystathionine β-synthase (CBS) and then cleavage of CBS produces cysteine. During transsulfuration, serine gives its carbon chain to cysteine and sulfur atom of cysteine comes from methionine. Within the body, cysteine catabolic pathways are sources of the synthesis of coenzyme A, glutathione, taurine, and oxidized and reduced inorganic sulfur. In the liver, two catabolic pathways of cysteine take place which includes oxidative pathway and desulfuration pathway, respectively. Briefly, in the oxidative pathway, cysteine sulfinate (intermediate in cysteine metabolism) is either transaminated to produce sulfite and pyruvate or decarboxylated to form taurine. The desulfuration pathway ends up with hydrogen sulfide and pyruvate. If the supply of cysteine is high, then the oxidative pathway is superior over desulfuration pathway and the desulfuration pathway increases when cysteine supply is low (Papet et al., 2019 ).

Metabolism of Hcy involves two pathways, which mainly include remethylation and transsulphuration. Remethylation is the process that requires methyl group for the conversion of Hcy into methionine, and the Remethylation process is carried out by betaine–homocysteine methyltransferase (BHMT) in the kidney and liver. Transculturation involves attachment of Hcy with serine and formation of cystathionine (a sulfur metabolite produced from Hcy) with the help of CBS (an enzyme) and vitamin B6 which acts as a coenzyme to synthesize cysteine (Hannibal & Blom, 2017 Ntaios, 2015 ). Cystathionine is hydrolyzed by an enzyme cystathionine γ-lyase (CL) and forms α-ketobutyrate and cysteine. Remethylation of Hcy can occur through the folate cycle in which it is catalyzed by vitamin B12 in the presence of an enzyme called methionine synthase to be recycled into methionine. The above two pathways are controlled by S-adenosylmethionine (SAM) which acts as an activator of CBS. If the diet is rich in methionine, then the conversion of dietary methionine into SAM occurs, and as a result, CBS activation increases, and transculturation is dominated over remethylation. On the other hand, if the diet is low in methionine then SAM concentration is not enough for activation of CBS and the result is remethylation of Hcy promoted over transsulfuration (Tsiami & Obersby, 2017 ).

1.2 Circulation of HCY

Homocysteine (Hcy) is metabolized in kidneys and liver, whereas in the pancreas and small intestine transsulfuration takes place. In the human body after the production of the low level of Hcy, almost 3% circulate freely in the body, with the majority of Hcy present in bound form with other molecules or in disulfide form (Rizzo & Sciorsci, 2018 ). Total plasma homocysteine (tHcy) is the sum of the circulating Hcy molecules either in its reduced or oxidized forms. The majority of tHcy about 98%–99% in its disulfide form is oxidized rapidly by reacting with other molecules that contain free thiol group like albumin (protein-containing free cysteine) and remaining exists as reduced form. Circulation of Hcy in our body is regulated by transsulfuration and remethylation pathways discussed above and by reabsorption in the kidney (Barroso, Handy, & Castro, 2017 ).

1.3 Normal concentration in the body

In human plasma, Hcy concentration is typically below 12–15 μM and the cysteine concentration level is 240–360 μM (Wang et al., 2019 ). Hcy level is high in males as compared to females it may be due to gender differences in Hcy metabolism and low concentration of vitamin B12 and folate in males. According to a population-based cross-sectional study, average Hcy concentrations were 12.6 in men and 9.6 μmol/L in women, respectively, and increase with age like 4.6–8.1 µmol/L in age 0–30 years, 6.3–11.2 µmol/L at age of 30–59 years in males and 4–5–7.9 µmol/L in females, and 5.8–11.9 µmol/L for age above 59 years (Cohen, Margalit, Shochat, Goldberg, & Krause, 2019 ).

4. Metabolism of Trp

4.1. Trp Degradation Pathways and the Gateway Enzymes

70% identical in sequence, differing mainly in the regulatory domains, and are expressed in a tissue-specific manner [22,23]. While TPH1 predominates in peripheral tissues that express serotonin (a neurotransmitter see later), such as the gastric system and skin, TPH2 is mostly expressed in neuronal cell types such as the central nervous system (CNS), specifically the brain. The difference in the regulatory domains likely allows them to have tissue-specific regulations. TPH is in fact a member of the amino acid hydroxylase superfamily that also comprises phenylalanine hydroxylase (PAH) and tyrosine hydroxylase (TH), all of which possess similar active sites and uses the same cofactors, and thus, there is substantial overlap in their substrates along with preference. TPH, for example, hydroxylates both Trp and Phe with comparable kinetics however, it hydroxylates Tyr at a

5000-fold slower rate [23]. The full implication and molecular mechanism of the substrate overlap may shed important light on the distribution and evolution of these enzymes, which have remained unresolved.

340-fold higher Km). It appears to have an accessory role in IDO1-mediated immune regulation and in inflammation [27,28]. In most literature, and in this review, the term IDO is to be considered synonymous to IDO1.

4.2. Secondary Metabolites of Trp

4.2.1. Metabolites of the Serotonin Pathway

16 h of daytime activity and

8 h of nightly sleep [39]. It is generally safe to use and is used to treat insomnia, jet lag and various sleep disorders, thrombocytopenia (chemo-induced), ‘winter blues’ and seasonal affective disorder (SAD), and tardive dyskinesia [40,41,42,43]. Melatonin also has some immune-regulatory and anticancer effects, but these effects need further studies and validation [44,45]. Melatonin is also produced synthetically and is freely available as an OTC dietary supplement.

4.2.2. Metabolites of the Kynurenine Pathway

4.2.3. Tryptophol and Related Indole Derivatives

4.2.4. Inhibition of Gluconeogenesis by Trp

Materials and methods

RNA isolation and transcriptome sequencing

Axenic cultures of Rhynchopus humris strain YPF1608 and Sulcionema specki strain YPF1618 were recently generated [18]. Hemistasia phaeocysticola strain YPF1303 was provided by Akinori Yabuki (JAMSTEC, Yokosuka, Japan). An axenic culture of Trypanoplasma borreli strain Tt-JH was isolated from a tench (Tinca tinca) [176] and kindly provided by Hanka Pecková (Institute of Parasitology). The RNA from three diplonemid species was isolated using Nucleospin RNA isolation kit (Macherey Nagel). The transcriptomic libraries of the diplonemids H. phaeocysticola (Hemistasiidae), R. humris, and S. specki (Diplonemidae) and the kinetoplastid T. borreli (Parabodonida) were prepared and sequenced on the Illumina HiSeq 4000 platform using the standard TruSeq protocol, resulting in

51 million paired-end unprocessed reads of 100 nt in length, respectively.

Clonal cultures of free-living eukaryovorous Prokinetoplatina strains PhM-4 and PhF-6 were isolated from brackish waters of Turkey and freshwaters of Vietnam, respectively. Total RNA was extracted using an RNAqueous-Micro Kit (Invitrogen, Cat. No. AM1931) and converted into cDNA using the Smart-Seq2 protocol [177]. Transcriptome sequencing was performed on the Illumina HiSeq 2500 platform with read lengths of 100 bp using the KAPA stranded RNA-seq kit (Roche) to construct paired-end libraries.

Assembling the collection of transcriptomes and genomes

Transcriptomic reads of H. phaeocysticola, R. humris, S. specki, and T. borreli were subjected to adapter and quality trimming using Trimmomatic v.0.36 [178] with the following settings: maximal mismatch count, 2 palindrome clip threshold, 20 simple clip threshold, 10 minimal quality required to keep a base, 3 window size, 4 required quality, 15 and minimal length of reads to be kept, 75 nt. Transcriptome assemblies were generated using Trinity v.2.2.0 with minimal contig length set to 200 nt, with the “normalize_max_read_cov” option set to 50 for R. humris, and with the other parameters set at the default values [179].

Transcriptomic reads of PhM-4 and PhF-6 were quality trimmed with Trimmomatic-0.32 [178] with a maximum of two mismatches allowed, a sliding window size of 4 and minimum quality of 20, and a minimum length of 35. Trinity version 2.0.6 was used to assemble the dataset, using default values [179]. Transcriptome assembly steps were done in conjunction with an extensive prey sequence decontamination process (below).

The transcriptome libraries of Rhabdomonas costata strain PANT2 (Euglenida) were prepared from 4 μg of total RNA according to the standard TruSeq Stranded mRNA Sample Preparation Guide. Libraries were sequenced on an Illumina MiSeq instrument (Illumina, San Diego, CA, USA) using 150 base-length read chemistry in a paired-end mode. Reads were assembled by Trinity v2.0.6 into 93,852 contigs.

The assembled transcriptomes of Neobodo designis (Kinetoplastea, Neobodonida) and Eutreptiella gymnastica (Euglenida) were downloaded from the Marine Microbial Eukaryote Transcriptome Sequencing Project database (MMETSP) [11]. We used the transcriptome assembly of Euglena gracilis strain Z generated by Ebenezer et al. and that of Azumiobodo hoyamushi generated by Yazaki and colleagues [15, 180]. Redundant transcripts were filtered out from all the transcriptome assemblies using the CD-HIT-EST software v.4.6.7 [181] with the sequence identity threshold of 90%. Prediction of coding regions within transcripts was performed using Transdecoder v.3.0.0 [182] under the default settings, and the resulting files with protein sequences were used for further analyses. Completeness of the transcriptome and genome assemblies was assessed using the BUSCO v.3 software [53] and the “eukaryota_obd9” database containing a set of 303 universal eukaryotic single-copy orthologs.

Reference genome and transcriptome assemblies and sets of annotated proteins were downloaded from publicly available sources listed in Additional file 1: Table S1. For bodonids (i.e., Prokinetoplastina, Neo-, Para-, and Eubodonida), all genomes and transcriptomes publicly available at the time of the manuscript preparation were used. For trypanosomatids, five representative genome sequences were selected, two belonging to distantly related monoxenous (=one host) species (P. confusum and L. pyrrhocoris) and three to dixenous organisms (T. brucei, T. grayi, and L. major), switching between two hosts in their life cycles. Recently, T. grayi from crocodiles and P. confusum parasitizing mosquitoes were demonstrated to be slowly evolving trypanosomatids, preserving the highest number of ancestral genes [48]. L. major and L. pyrrhocoris, belonging to the subfamily Leishmaniinae, are characterized by different lifestyles [183]. T. brucei and L. major belong among the most extensively studied trypanosomatids and have high-quality genome assemblies and annotations available. The latter is also true for L. pyrrhocoris [51]

Decontamination of the R. costata, N. designis, and Prokinetoplastina spp. transcriptomes

The culture of R. costata was non-axenic, and accordingly, the presence of transcripts belonging to contaminating species was detected using a BLASTN search against the SILVA database with an E value cut-off of 10 −20 [184]. The best-scoring contaminants represented β- and γ-proteobacterial small-subunit (SSU) rRNA sequences. The following decontamination procedure was applied in order to get rid of the bacterial sequences: (i) a BLASTX search against the NCBI nr database using R. costata transcripts as queries with an E value cut-off of 10 −20 (ii) the BLAST results were sorted according to the bitscore and only 20 best hits were retained for each R. costata query sequence (iii) the best-scoring hits were annotated as “bacterial”, “eukaryotic”, and “other” (iv) transcript sequences were considered to be of bacterial origin and excluded from further analyses if more than 60% of best hits were bacterial according to the results of classification at the previous step. The decontamination procedure described above and prediction of coding regions within the transcripts of non-bacterial origin has produced a dataset of 36,019 protein sequences, with 3679 proteins removed as bacterial contaminants.

A BLASTN search against the SILVA database using N. designis transcripts as queries with an E value cut-off of 10 −20 revealed the presence of SSU rRNA sequences belonging only to a γ-proteobacterium of the genus Alteromonas. Since no other contaminants were identified, we downloaded all available genomes of Alteromonas spp. from the NCBI database and used them as a database for filtering out putative bacterial sequences from the N. designis transcriptome using BLASTN with an E value cut-off of 10 −5 . The contamination level was low, and this procedure resulted in removal of just 22 putative bacterial contigs from the transcriptome assembly.

As PhM-4 and PhF-6 are grown with the bodonids Procryptobia sorokini, and Parabodo caudatus as prey, respectively, we minimized contamination of the PhM-4 and PhF-6 datasets through an extensive bioinformatic decontamination procedure. This includes decontamination steps that took place before and after assembly of the PhM-4 and PhF-6 datasets. Before assembly of PhM-4 and PhF-6, we assembled 2 × 300 bp PE transcriptome reads from monoeukaryotic P. sorokini and P. caudatus prey cultures, along with 100 bp PE HiSeq 2000/2500 datasets derived from previously published datasets [185] in which other species preyed upon either P. sorokini or P. caudatus (i.e., cultures that were heavily contaminated by the same prey species). RNA-seq reads from PhM-4 and PhF-6 datasets were mapped to the assemblies containing P. sorokini or P. caudatus contigs, respectively, using Bowtie2 version 2.1.0 [186]. Reads that mapped to the prey assemblies (along with their mates, if only one read mapped) were discarded. The resulting unmapped reads were used to generate crude PhF-6 and PhM-4 transcriptome assemblies. To identify further prey-derived contamination, we used crude PhF-6 and PhM-4 assemblies to query the assembled transcriptomes of either P. caudatus or P. sorokini via megablast version 2.2.30 [187]. We considered a contig as a putative contaminant if it was ≥ 95% identical to sequences in the prey assemblies over a span of at least 75 bp. In the case of PhF-6, which was more extensively contaminated by prey than PhM-4, we added an additional step of mapping raw Illumina HiSeq2000 and MiSeq reads containing P. caudatus to the PhF-6 assembly contigs with mapped reads were discarded. Potential cross-contamination from species multiplexed on the same HiSeq 2500 run was removed using the script from the BBMap package [188], with the options minc = 3, minp = 20, minr = 15, and minl = 350.

Gene family inference and phylogenetic tree construction

Orthologous groups (OGs) containing proteins from 19 species (Additional file 1: Table S1) were inferred using OrthoFinder v.1.1.8 [189] under default settings. The heterolobosean Naegleria gruberi was used as an outgroup. For phylogenetic tree construction, OGs containing only one protein in each species were analyzed (52 OGs in total). Protein sequences of R. costata were additionally compared against the NCBI nr database with a relaxed E value cut-off of 10 −10 in order to exclude any sequences of potential bacterial origin, which were not filtered out as described in the previous section with a more stringent E value cut-off of 10 −20 , but no contaminating sequences were identified. Inferred amino acid sequences of each gene were aligned using the L-INS-i algorithm in MAFFT v.7.310 [190]. The average percent identity within each OG was calculated using the alistat script from the HMMER package v.3.1 [77]. Twenty OGs demonstrating average percent identity within the group of > 50% were used for the phylogenomic analysis. The percent identity threshold was applied since our previous experience with euglenozoan phylogenomics [51, 191] shows that excluding highly divergent sequences improves the resolution of both maximum-likelihood and Bayesian trees. The protein alignments were trimmed using Gblocks v.0.91b with relaxed parameters (-b3 = 8, -b4 = 2, -b5 = h) and then concatenated, producing an alignment containing 6371 characters. A maximum-likelihood tree was inferred using IQ-TREE v.1.5.3 with the LG+F+I+G4 model and 1000 bootstrap replicates [192, 193]. A Bayesian phylogenetic tree was constructed using PhyloBayes-MPI v.1.7b [194] under the GTR-CAT model with four discrete gamma categories. Four independent Markov Chain Monte Carlo chains were run for

8000 cycles, and all chains converged on the topology shown in Fig. 1. The initial 20% of cycles were discarded as a burn-in, and sampling every 5 cycles was used for inference of the final consensus tree visualized using FigTree v.1.4.3 [195].

Analysis of metabolic pathways

For the analysis of metabolic capacities, an automatic assignment of KEGG Orthology (KO) identifiers to the proteins of the species of interest (Additional file 1: Table S1) was conducted using BlastKOALA v.2.1 [55]. The search was performed against a non-redundant pangenomic database of prokaryotes at the genus level and eukaryotes at the family level. KEGG Mapper v.2.8 was used for reconstruction of metabolic pathways and their comparison [196]. An enzyme was considered to be present in a particular group (diplonemids, euglenids, or kinetoplastids) if it was identified in at least two organisms belonging to that group (or in one species in the case of Prokinetoplastina). In certain cases, for verifying the original functional annotations, additional BLAST and/or Hidden Markov model-based (HMM) searches were performed with an E value cut-offs of 10 −20 and 10 −5 , respectively, unless other parameters are specified. The number of metabolic proteins reported for a species is equal to the number of unique KO identifiers falling into the KEGG category “metabolism” assigned to the proteins encoded in the genome/transcriptome of that species. The term “metabolic proteins” is used herein to refer to the proteins belonging to the KEGG category “metabolism.” The analysis of protein sharing was performed using UpSetR package [197]. The unpaired t test was applied when necessary to test statistical significance of the observed differences in average number of unique KEGG identifiers across species groups.

For the comparison of metabolic capabilities of euglenozoans with those of other protists, high-quality genome assemblies of 16 free-living heterotrophic and 17 parasitic/symbiotic organisms were downloaded from the NCBI Genomes database (Additional file 1: Table S2). Assemblies demonstrating BUSCO coverage more than 75% for free-living species and 45% for parasites and symbionts were considered of high quality and analyzed using BlastKOALA v.2.1 as described for euglenozoans. A shared loss of a metabolic protein in kinetoplastids and ciliates was inferred if a protein was absent in both groups, while being present in at least three species of the free-living heterotrophic protists from other groups listed in Additional file 1: Table S2.

Species clustering using the Uniform Manifold Approximation and Projection algorithm

Uniform Manifold Approximation and Projection (UMAP) is a novel general-purpose non-linear algorithm for dimensionality reduction [60]. The UMAP algorithm implemented in the uwot v0.1.3 R package [60] was applied to pairwise distances between 2181-dimensional vectors (presence/absence data for metabolic KO identifiers) for 19 species. First, we tried to find optimal values of key UMAP parameters that are suitable for recovering both local and global structure. The following setting combinations were tested: (1) the Euclidean or Hamming distance metrics, (2) number of nearest neighbors from 2 to 18, and (3) for each number of nearest neighbors, minimal distance between points in the 2D embedding was varied from 0 to 0.9 in 0.1 increments. The Euclidean and Hamming distance metrics yielded similar results, and the latter was selected as more appropriate for binary data. After inspecting all the resulting 2D embeddings, 3 was selected as the optimal number of nearest neighbors and 0 as the optimal minimal distance. Next, we ran 20 iterations of the algorithm with different random seeds generating both 2D and 3D embeddings of the multidimensional data structure. This was done to check whether the clustering remains stable across iterations. Results of 10 iterations are shown for both 2D (Additional file 6: Fig. S5) and 3D embeddings (Additional file 7: Fig. S6). The latter embeddings were visualized using the plot3D R package.

Fatty acid biosynthesis

For the analysis of elongase repertoire, four proteins of T. brucei (TbELO1–4) described by Lee et al. [106] were used as a query in BLASTP search with an E value cut-off of 10 −20 against the euglenozoan protein database. Phylogenetic trees were reconstructed using IQ-TREE with automatic model selection and 1000 bootstrap replicates for two datasets: (i) euglenozoan proteins only and (ii) euglenozoan sequences along with functionally characterized elongases from several other organisms (Additional file 14: File S1 Additional file 15: File S2) [109, 198,199,200,201]. For the identification of fatty acid synthase (FAS) I and II, proteins of Saccharomyces cerevisiae and Homo sapiens were used as queries with an E value cut-off of 10 −10 [202, 203]. FAS I enzyme was considered to be present if at least three functional domains were identified on the same transcript.

Analysis of trypanothione metabolism

Genes encoding the enzymes of the trypanothione biosynthetic pathway were considered to be present in a genome or transcriptome when the following conditions were fulfilled: (i) a protein could be identified by BLAST with an E value cut-off of 10 −20 and/or a corresponding KEGG ID was assigned to a protein and (ii) p-distances between a reference protein and a putative hit calculated using MEGA v.7 did not exceed 0.7 or a different threshold specified in Additional file 13: Tables S41-S51 [204]. Additionally, the presence of a splice leader (SL) sequence was checked in the case of transcriptomic data, requiring a match with a minimal length of 12 nt. When a protein of interest could not be identified among predicted proteins, additional BLAST searches with raw transcriptome/genome sequences as a database were performed using an E value threshold of 10 −10 . For glutathionylspermidine (GspS) and trypanothione synthetases (TryS), as well as trypanothione (TR), glutathione (GR), and thioredoxin (TrxR) reductases, HMM-based searches using the HMMER package v.3.1 [77] were performed in addition to BLAST searches. An HMM model for GspS was generated using the Pfam seed alignment PF03738, and HMM models for other enzymes were obtained based on alignments of annotated sequences from the KEGG database. Two groups of proteins, GspS + TryS and TR represent related proteins, share a certain degree of sequence similarity and could be aligned (Additional file 13: Tables S50 and S51). For the identification of GspS/TryS homologues outside Euglenozoa, TryS of T. brucei was used as a query in a BLASTP search against the NCBI nr database (E value 10 −20 ) and 1000 best hits for two groups, prokaryotes (group I) and other organisms (excluding Euglenozoa group II), were obtained and combined into one file. Then, the sequences were filtered using CD-HIT-EST software v.4.6.7 [181] with 98% protein identity threshold. For the TR/GR/TrxR phylogeny, the corresponding protein sequences of Emiliania huxleyi, Homo sapiens, and trypanosomatids Blechomonas ayalai, Endotrypanum monterogeii, and T. cruzi were used as a reference. Sequences were aligned using Muscle v.3.8.31 with default parameters [205]. The resulting alignments were trimmed using trimAl v.1.4.rev22 with the “-strict” option [206]. Maximum-likelihood trees for both protein groups were build using IQ-TREE v.1.5.3 with 1000 and 100 bootstrap replicates, for reductases and synthases, respectively and the LG+I+G4 model (automatically selected). Bayesian trees were inferred using MrBayes v.3.2.6 with the models of rate heterogeneity across sites chosen based on IQ-TREE results, while models of amino acid substitutions were assessed during the analysis (mixed amino acid model prior). The resulting model was WAG+I+G4 for both synthetases and reductases. The analysis was run for one million generations with sampling every 100th of them and discarding the first 25% of samples as a burn-in.

Identification of the DNA pre-replication complex subunits

Identification of the pre-replication complex (pre-RC) complex subunits was a multi-step procedure. Initially, BLAST searches with the reference sequences listed in Additional file 16: Table S52 as queries and an E value threshold of 10 −5 against databases of annotated transcripts/genomes of the euglenozoans and protists belonging to other groups (Additional file 1: Table S2) were performed. If a target protein could not be identified, an HMM-based method was employed. Pre-computed models for the proteins of interest were downloaded from the Pfam database when available (Additional file 16: Table S52), or a new model was generated based on a protein alignment constructed using Muscle v.3.8.31 [205, 207]. When none or just a few euglenozoan proteins were identified, another round of HMM-based searches was performed. For that purpose, full-length reference sequences present in the seed alignment were downloaded from the Pfam database, and, when possible, high-scoring hits in euglenozoans and reference protists were added to the seed alignment (E value < 1 −20 , preferably only full-length sequences with predicted domains). For HMM model construction, both trimmed and untrimmed alignments were used, and the search results were compared. Alignment trimming was accomplished in trimAl v.1.4.rev22 with the “-gappyout” option [206]. Visual inspection of phylogenetic trees constructed using IQ-TREE with automatic model selection and 1000 fast bootstrap replicates was performed to facilitate annotation of related sequences [192, 193].

Maximum-likelihood and Bayesian trees for the minichromosome maintenance (MCM) complex subunits 2–9 were inferred as described for the TR/GR/TrxR proteins, with the LG+F+I+G4 and WAG+I+G4 models, respectively. Only BLAST hits with p-distances ≤ 0.75 were considered. The trees were rooted using archaeal MCM sequences belonging to Haloferax volcanii (ADE04992), Methanoculleus sp. MAB1 (CVK32523.1), Nanoarchaeum equitans (NP_963571.1), and Sulfolobus acidocaldarius (WP_011277765.1).

Putative homologues of the winged-helix initiator protein were searched using an HMM model build based on an alignment of 35 archaeal sequences downloaded from the NCBI Protein database.

Analysis of putative lateral gene transfer (LGT) events

For the analysis of putative LGT events, the protein sequences encoded by the genes of interest were used as a query in a BLASTP search against the NCBI nr database (E value 10 −20 ) and 1000 best hits for each, prokaryotes and other organisms (excluding Euglenozoa), were obtained. The resulting sequences were filtered using CD-HIT-EST software v.4.6.7 [181] with 90–98% protein identity threshold (depending on the protein identity levels). Sequences were aligned using Muscle v.3.8.31 with default parameters [205], and the resulting alignment was trimmed with trimAl v.1.4.rev22 [206] and used for phylogenetic analyses. Maximum-likelihood and Bayesian trees were inferred as described for trypanothione biosynthetic enzymes with the automatically selected LG+I+G4 model and 100 standard bootstrap replicates (for maximum-likelihood analysis). The trees were visualized in FigTree v.1.4.3 [195].

Identification of the kinetochore machinery elements

For the identification of putative centromeric histones H3 (cenH3), all available sequences of the canonical histone H3 (caH3) and its variants were downloaded from HistoneDB v.2.0 [208] and used as a BLAST query against transcripts, genomes, and predicted proteins of Euglenozoa with an E value threshold of 10 −5 . A hit was considered as a cenH3 candidate if it satisfied the following criteria: (i) at least one amino acid insertion in the loop 1 of the histone fold domain, (ii) divergent N-terminal tail, (iii) absence of the conserved glutamine residue in the α1 helix of the histone fold domain, and (iv) presence of a divergent histone fold domain [160]. Trypanosomatid-specific histone H3 variant (H3V) sequences were identified based on the presence of all of the following features: (i) a divergent N-terminal tail, (ii) absence of the conserved glutamine residue in the α1 helix of the histone fold domain, and (iii) absence of insertions in the loop 1 of the histone fold domain [209]. Distinguishing between putative caH3 and replication-independent histone variant H3.3, differing by only a few amino acids in opisthokonts [210], was out of scope of the current study, and the corresponding sequences were annotated as caH3/H3.3 (Additional file 9: Table S40).

Pre-computed HMMs for other conventional kinetochore components with the IDs specified in Additional file 16: Table S53 were downloaded from the Pfam database, and several rounds of HMM-based searches were performed as described for the DNA pre-replication complex subunits. Additionally, sequences of conventional kinetochore proteins identified by van Hooff and colleagues [158] in multiple eukaryotic lineages were used for building new HMMs, thus overcoming the bias towards overrepresentation of opisthokont sequences in the Pfam database. Only the most conserved components of the conventional kinetochore machinery were considered in our analyses, including the Ndc80 complex (Ndc80, Nuf2, Spc24, and Spc25 subunits), Knl1, the Mis12 complex (Mis12, Nnf1, Dsn1, and Nsl1), and CenpC.

For the identification of the kinetoplastid kinetochore proteins (KKTs), sequences annotated as KKTs were downloaded from the TriTryp database release 41, combined with the homologues identified in the eubodonid Bodo saltans [38], aligned using Muscle v.3.8.31 with default parameters [205], and used for HMM building and subsequent searches. Hits were annotated as putative KKTs when they met all of the following criteria: (i) HMM hit E value ≤ 10 −5 , (ii) p-distances calculated using MEGA v.7 did not exceed 0.8 or a different threshold specified in Additional file 13: Tables S13-S31 [204], and (iii) hit coordinates extending beyond predicted borders of highly conserved domains known to be present in proteins with unrelated functions. In the case of KKT2, 3, 10, and 19, HMM-based searches returned many hits due to the presence of widespread kinase domains [38, 162], and in order to facilitate annotation process, only two best hits for each species were taken for phylogenetic tree inference in IQ-TREE v.1.5.3 with 1000 fast bootstrap replicates (Additional file 17: File S3 Additional file 18: File S4 Additional file 19: File S5). Distinguishing between KKT10 and KKT19 proved to be a complicated task due to a very high degree of sequence similarity, and therefore, tentative annotation was performed based on the p-distances to the corresponding sequences in B. saltans.

Kinetoplastid kinetochore-interacting proteins (KKIPs) of T. brucei [163] were used as a BLAST query against the TriTryp database release 41 with an E value threshold of 10 −20 . Retrieved sequences were aligned and p-distances were calculated as described above. Hits with p-distances ≤ 0.8 to the homologues in T. brucei were aligned and used for HMM-based searches. The hits were filtered as described for the KKT proteins. For the phosphatase domain-containing KKIP7, only the hits with an E value ≤ 10 −100 and p-distances ≤ 0.65 to the reference trypanosomatid sequences (Additional file 13: Tables S32-S38) were subjected to the phylogenetic analysis using IQ-TREE v.1.5.3 with 1000 fast bootstrap replicates (Additional file 20: File S6).

Materials and Methods

Data Preparation

Data set of reference coronaviruses: viral genomes were downloaded from GenBank ( last accessed November 9, 2020) and GISAID ( last accessed November 9, 2020). Representative coronaviruses of different species were selected from complete genomes, with reference genomes recommended by the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses ( last accessed November 9, 2020) and NCBI retained preferentially. For viruses containing isolates from different hosts, at least one representative strain from each host was kept. The genomes were aligned using MAFFT v7.427 ( Katoh et al. 2002) and manually checked with BioEdit. Alignment of full-genomic sequences was used for phylogeny reconstruction, whereas the coding regions for ORF1ab were extracted for codon usage analysis.

Data set of SARS-CoV-2: a total of 17,037 SARS-CoV-2-related sequences were available from GISAID on May 6, 2020 ( Elbe and Buckland-Merrett 2017). Only SARS-CoV-2 genomes isolated from human, with a full length over 27,000 bp, no ambiguous sites, and detailed collection date information were used for alignment. For duplicate sequences, only the earliest isolate was kept. Sequences for 26 coding regions, including Nsp1, Nsp2, Nsp3, Nsp4, Nsp5, Nsp6, Nsp7, Nsp8, Nsp9, Nsp10, Nsp11, Nsp12, Nsp13, Nsp14, Nsp15, Nsp16, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N, and ORF10 were extracted for each strain, using NC_045512 as reference. The coding sequences were checked manually to exclude those with abnormal mutations and early stop codons. A total of 4,110 strains with all 26 coding regions of complete ORF length were retained. After further deduplication based on the concatenated sequences comprised the 26 ORFs, the final data set contained a total of 2,574 unique SARS-CoV-2 isolates.

Phylogeny Reconstruction

Phylogenetic tree of the 89 representative coronaviruses was inferred using the maximum likelihood method implemented in IQ-TREE v1.6.12 with the GTR + F + I + G4 substitution model determined by ModelFinder ( Nguyen et al. 2015 Kalyaanamoorthy et al. 2017 Hoang et al. 2018). Ultrafast bootstrap support values were calculated from 1,000 pseudoreplicate trees ( Kalyaanamoorthy et al. 2017). Visualization of phylogenies was conducted with ggtree package ( Yu 2020).

Gap-Based Alignment

The full alignment of the 89 reference strains was used to generate a tree, using FastTree 2.1.10 ( Price et al. 2010) (with gamma distribution and the nucleotide option on—namely with the command options -gamma -nt), on the server ( Lemoine et al. 2019). The Jukes–Cantor model with balanced support Shimodaira–Hasegawa test was selected ( Shimodaira and Hasegawa 1999). Total branch length was: 14.267.

Furthermore, a gap-based alignment was created, using gaps as follows: all dinucleotides were replaced with the “undefined” symbol “x” and the “dummy” symbols (W for 3, Y for 6, and F for 9 consecutive gaps and the V symbol for all single gaps), leaving only single-nucleotides in-between gaps as anchor points (7% of total). The encoding in gaps of 3/6/9 is used to emulate the importance of potential codon gaps (reflected in the BLOSUM45 matrix). Total branch length was: 1.673.

Gap-based genome-based phylogenetic reconstruction for this group is based on the fact that, as also mentioned recently elsewhere ( Li et al. 2020), these viruses undergo significant recombination and a large number of nucleotide positions achieve saturation thus confounding phylogenetic signal. Tree visualization was facilitated by IcyTree ( Vaughan 2017).

Base Content Calculation

Base content was calculated by dividing the occurrence of each base by the total length of the sequence. Genomic base contents of representative coronaviruses were calculated with the full viral genome sequences. For the base content dynamic analysis of SARS-CoV-2, base compositions were calculated using the 2,754 unique sequences concatenated by 26 ORFs.

Codon Usage Analysis

Codon usage analysis was conducted based on the ORF1ab region of representative coronaviruses and the 26 individual ORFs of SARS-CoV-2 strains. RSCU value was defined as the ratio of the observed codon usage to the expected value ( Sharp and Li 1986). Codons with an RSCU value of 0, 0–0.6, 0.6–1.6, or >1.6 were regarded as not-used, underrepresented, normally used, or overrepresented ( Uddin 2017). RSCUs for the 120,426 human-coding regions were determined based on the Homo sapiens codon usage table retrieved on June 14, 2020 from TissueCoCoPUTs ( Kames et al. 2020).

Statistical Analysis and Plots

Statistical test, linear regression, and data visualization were all conducted in R. Kruskal–Wallis test by rank and Wilcoxon rank-sum test for pairwise comparisons were applied as appropriate. P values are labeled as follows: <0.0001, **** 0.0001–0.001, *** 0.001–0.01, ** 0.01–0.05, * ≥0.05, not labeled. P < 0.05 was considered as significant.


Department of Biotechnology, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, 721302, India

School of Bioscience, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, 721302, India

School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, 721302, India

Pradipta Patra & Amit Ghosh

P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, 721302, India

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar


P.N. and A.G. planned the project. P.N., P.P., M.D. and A.G. conducted the analysis. P.P., P.N., M.D. and A.G. wrote the paper.

Corresponding author

Plant amino acid-derived vitamins: biosynthesis and function

Vitamins are essential organic compounds for humans, having lost the ability to de novo synthesize them. Hence, they represent dietary requirements, which are covered by plants as the main dietary source of most vitamins (through food or livestock’s feed). Most vitamins synthesized by plants present amino acids as precursors (B1, B2, B3, B5, B7, B9 and E) and are therefore linked to plant nitrogen metabolism. Amino acids play different roles in their biosynthesis and metabolism, either incorporated into the backbone of the vitamin or as amino, sulfur or one-carbon group donors. There is a high natural variation in vitamin contents in crops and its exploitation through breeding, metabolic engineering and agronomic practices can enhance their nutritional quality. While the underlying biochemical roles of vitamins as cosubstrates or cofactors are usually common for most eukaryotes, the impact of vitamins B and E in metabolism and physiology can be quite different on plants and animals. Here, we first aim at giving an overview of the biosynthesis of amino acid-derived vitamins in plants, with a particular focus on how this knowledge can be exploited to increase vitamin contents in crops. Second, we will focus on the functions of these vitamins in both plants and animals (and humans in particular), to unravel common and specific roles for vitamins in evolutionary distant organisms, in which these amino acid-derived vitamins play, however, an essential role.

This is a preview of subscription content, access via your institution.

One-Carbon Metabolism: Linking Nutritional Biochemistry to Epigenetic Programming of Long-Term Development

One-carbon (1C) metabolism comprises a series of interlinking metabolic pathways that include the methionine and folate cycles that are central to cellular function, providing 1C units (methyl groups) for the synthesis of DNA, polyamines, amino acids, creatine, and phospholipids. S-adenosylmethionine is a potent aminopropyl and methyl donor within these cycles and serves as the principal substrate for methylation of DNA, associated proteins, and RNA. We propose that 1C metabolism functions as a key biochemical conduit between parental environment and epigenetic regulation of early development and that interindividual and ethnic variability in epigenetic-gene regulation arises because of genetic variants within 1C genes, associated epigenetic regulators, and differentially methylated target DNA sequences. We present evidence to support these propositions, drawing upon studies undertaken in humans and animals. We conclude that future studies should assess the epigenetic effects of cumulative (multigenerational) dietary imbalances contemporaneously in both parents, as this better represents the human experience.