Creating a phylogenetic tree from my selected publicly-available sequences (WGS) in NCBI

Creating a phylogenetic tree from my selected publicly-available sequences (WGS) in NCBI

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm currently writing a paper on the comparison of virulence genes for a group of bacteria. I got my data from publicly-available whole genome sequences in NCBI. Now, I want to create a phylogenetic tree for these species but since I'm working from home, I'm using my old laptop that makes it hard for me to process data. Can you recommend any software/ site/ tutorial/ paper that I can use to generate a phylogenetic tree easily? This is also the first time that I will create a phylogenetic tree and I'm not sure how to proceed without downloading and processing too much data that my current laptop can't handle.

Thank you!

Large scale automated phylogenomic analysis of bacterial isolates and the Evergreen Online platform

Public health authorities whole-genome sequence thousands of isolates each month for microbial diagnostics and surveillance of pathogenic bacteria. The computational methods have not kept up with the deluge of data and the need for real-time results. We have therefore created a bioinformatics pipeline for rapid subtyping and continuous phylogenomic analysis of bacterial samples, suited for large-scale surveillance. The data is divided into sets by mapping to reference genomes, then consensus sequences are generated. Nucleotide based genetic distance is calculated between the sequences in each set, and isolates are clustered together at 10 single-nucleotide polymorphisms. Phylogenetic trees are inferred from the non-redundant sequences and the clustered isolates are added back. The method is accurate at grouping outbreak strains together, while discriminating them from non-outbreak strains. The pipeline is applied in Evergreen Online, which processes publicly available sequencing data from foodborne bacterial pathogens on a daily basis, updating phylogenetic trees as needed.


On a global scale human brucellosis is one of the most common bacterial zoonotic diseases [1]. However, its occurrence greatly differs between geographic areas throughout the world. The disease has its highest incidence and prevalence in countries of the Mediterranean basin, Middle East, some parts of Central and South America, Africa and Asia [2]. The highest annual incidences of human brucellosis per million of the population are observed in Syria and in Mongolia [3]. In contrast, countries of Northern and Western Europe are considered free of autochthonous human brucellosis. However, each year a relatively small number of cases are reported in Germany, with most of them having a history of travelling to or immigrating from endemic regions like the Mediterranean basin. In total, 267 cases of human brucellosis with a median number of 24 cases per annum were reported during the last decade (2005–2015) to the national surveillance system at the Robert Koch Institute in Berlin ([email protected] 2.0,, data as of 09/05/2016). Notably, within the last two years an increase of cases was observed. In 2014 and 2015, the number of annual cases nearly doubled to 47 and 44, respectively.

The most important route of transmission is the ingestion of contaminated raw milk and other unpasteurized dairy products, but the disease may also be acquired by handling infected animals, animal discharges, or cultures of the pathogen [4]. Human-to-human transmissions are rare events. Clinical manifestations of brucellosis are versatile and comprise acute systemic infections with undulating fever as cardinal symptom as well as localized inflammatory processes and site-specific manifestations in chronically infected patients. A 6-week regimen comprising rifampin and doxycycline is commonly applied in Germany to eradicate the pathogen [5]. Patients with complications often need longer treatment and additional antimicrobial substances like aminoglycosides [6].

Human brucellosis can be caused by various Brucella species. The genus currently comprises 12 validly published species which are genetically highly related to each other [7,8]. Brucella melitensis is by far the most frequently observed causative agent of human infection. Based on DNA-DNA hybridization studies that revealed DNA-DNA homologies of >80%, all classical species might be attributed to a single species with several biovars [9]. However, due to historic reasons, species-specific predilections for particular animal hosts, biochemical features etc. the Subcommittee on the taxonomy of Brucella agreed in 2003 on a return to the pre-1986 taxonomic opinion with a six species concept (B. abortus, B. canis, B. melitensis, B. neotomae, B. ovis, B. suis) [10].

Multiple Locus Variable Number Tandem-Repeat (VNTR) Analysis (MLVA) has become a major molecular typing method to characterize several pathogenic bacterial species. Brucella MLVA-16 scheme has been proven to be a valuable tool in epidemiological trace-back investigations with high discriminatory power in several studies [11–17]. Recently, a full genome SNP-based phylogenetic analysis was published and could provide another powerful tool for intraspecies discrimination of Brucella [18].

Increasing numbers of brucellosis cases in Germany in 2014 and 2015 triggered us to investigate the geographic origin of strains isolated in Germany. Using the National Consultant Lab´s collection of clinical isolates of B. melitensis originating from 57 patients diagnosed as brucellosis cases in Germany since 2014, our study was designed (i) to trace back the strains to their geographic origin (ii) to compare the resolution of WGS-based SNP-typing with the standard MLVA-16 genotyping (iii) to determine genetic variation within the genomes of the strains that could affect molecular diagnostic or typing assays using next-generation sequencing technology.

Data availability

The Estonian WGS data are available on demand through the Estonian Biobank: In accordance to the consent form signed by the customers of Gene by Gene commercial genetic testing company, the sequencing data included in this study is used for the sole purpose of scientific inquiry and is reported here on an aggregate level in the form of phylogenetic trees. For both the Estonian Biobank and the Gene by Gene samples, summary-level data including variable positions and their frequency in the cohort population have been deposited to dbSNP with links to BioProject accession number PRJNA718714 in the NCBI BioProject database ( The Swedish data from the SweGen Project is available upon request from the original authors of the project [23].


In this report, we have described the genomic and phylogenetic features of TPA strains detected in Japan compared to TPA strains detected in other countries, in particular in China. A significant feature of this study was that our analysis included information on the gender and sexual orientation of the syphilis patients from whom TPA strains had been detected.

The maximum likelihood phylogenetic analysis and the Bayesian temporal analysis of the genomes of global TPA strains in this study found similar results to previous reports in terms of lineages and sub-lineages. In those reports, most of the SS14-lineage strains in American and European countries were classified in lineages SS14Ω-A 10,11 .

The majority of TPA strains analyzed in this study in Japan (16 of 20) were classified in the SS14-lineage and formed an EAC, designated Sub-lineage 1B (in lineage SS14Ω-B) in a previous study 11 , that included strains in China. In addition to these strains, there were 3 Nichols-lineage strains and a strain belonging to another SS14 sub-lineage, previously designated as Sub-lineage 8, which contained strains in the U.S. and European countries 11 . WGS indicated an ongoing concurrent circulation of Nichols- and SS14-lineage strains in Japan, as has also been observed in several American and European countries 10,11 . Recently, four TPE strains from Japan were reported between 2014 and 2018 16 . However, the definitions of subspecies of those strains were based on the sequencing of tp0548 and tp0856 genes, but were not based on WGS analyses 16 . Although we could not detect any TPE strain among the strains that passed our criteria described above and in “Methods”, we have to keep monitoring the emergence and spread of TPE strains in Japan.

The EAC was composed of SS14-lineage TPA strains in China and Japan, with most TPA strains in Japan (16/17) being subtype 14d/f, which is the dominant subtype among both heterosexual and MSM cases in Japan 17 . We could not evaluate the difference between subtype 14d/f strains from MSM and heterosexual cases in our previous molecular typing studies 17,18 . The WGS study in this report elucidated the separate phylogeny of the strains from MSM and heterosexuals belonging to this same subtype (Table 1, and Fig. 2b).

These results underlined the importance of having information on gender and sexual orientation in analyzing WGS studies of TPA to comprehend the detailed aspect of the circulations of strains in the respective communities of MSM and heterosexuals.

From the geographical point of view, the experimental results for strains from heterosexual cases collected in Tokyo and Osaka prefectures were mixed, showing that genetically similar strains were circulating among heterosexuals in these two prefectures that have the largest populations in Japan.

Based on the phylogenetic analyses with the strains collected since 2011 to 2018 (Table S1, strains in China and Japan), the MRCA of the EAC strains and of strains in China appeared to emerge in 2006 (node III in Fig. 2b), followed by the MRCA of strains in Japan in 2007 (node V in Fig. 2b). Therefore, the EAC has separated into Chinese and Japanese clusters since the mid 2000s. The estimated time of emergence of the MRCA of the Chinese cluster, determined in this study, was similar to that in a recent report 11 . The genomes of TPA strains in China and Japan then expanded their genetic diversity after the late 2000s and mid-2010s, respectively, which approximately corresponded to the time each country had an increasing number of primary and secondary syphilis cases (Fig. 3). This correspondence may be consistent with the fact that TPA is an obligate human pathogen and its accumulation of SNP sites increases with time. However, for the genomes of TPA strains in China, there was a discrepancy in that there was not an increase in their genetic diversity from the mid-1990s to the early 2000s (Fig. 2b), although the onset of the increase in the number of cases was in the late 1990s (Fig. 3). This fact may be attributed to detection bias owing to a limited number of samples included in this study. The simplest explanation is that genomes of TPA in China during the early stage of the syphilis outbreak had not been collected or analyzed systematically enough.

Since contagious pathogens can cross borders, the EAC may be an example of cross-border propagation of TPA strains between Japan and China. Although the cause of the current large syphilis outbreak in Japan may be uncertain, an increase in the number of travelers from China to Japan has been noted. The Japan National Tourism Organization ( has reported that the number of Chinese travelers to Japan has increased rapidly since 2014, which corresponded with the onset of the recent syphilis outbreak (Fig. 3) mainly among heterosexuals in Japan. However, our results indicated that the MRCA of the TPA strains in the sub-cluster formed by the strains from heterosexuals in Japan was likely to be extant in 2013 (node VIII in Fig. 2b). In addition, based on their phylogeny, the MRCA of TPA strain 15A011MM and the strains from heterosexuals in Japan may have been extant in 2007 (node V in Fig. 2b). Therefore, we consider the hypothesis of a connection between Chinese tourists and the syphilis outbreak in Japan controversial, although our interpretation is based on a limited number of samples and on values with a wide (6–7 year) 95% HPD (Fig. 2b).

The results of this study indicated several features about the genetic variants of some TPA genes. First, macrolide resistance mutations in TPA might be reversible, although there have been no data indicating reversion to the wild-type allele. As all the strains in Japan in Fig. 2b were phylogenetic branches from node V, it seems more likely that macrolide resistance was extant at the time node V formed, rather than that most of the strains in Fig. 2b independently mutated to macrolide resistance during the relatively short time after node V formed. Macrolide resistance mutations have been noted previously to have strong stability 11,19 , but existence of 2 macrolide-sensitive TPA strains in Japan (Fig. 2b) implied that there might have been a reverse mutation because of a fitness cost associated with carrying those mutations. However, a fitness cost- hypothesis contradicts the fact that most of the strains in Japan depicted in Fig. 2b still keep the resistance mutations. So, we could not exclude the possibility that the apparent ‘reverse mutations’ occurred independent of the putative fitness cost during the diversifications as other general SNPs did and have been kept under the absence of the drug. This scenario might meet the fact that azithromycin is no longer recommended for treatment to syphilis in most of the countries in the world.

Second, all the macrolide-resistant strains in Japan in this study carried the A2058G mutation in the 23S rRNA gene, but not the A2059G mutation. This was in agreement with the results of our previous study of over 100 strains in Japan 17,18 .

Third, in this study, the mrcA A506V mutation, which was considered to be unique to strains in China 9,11 , was also observed in EAC strains in Japan. However, some mutations in the pbp2 gene (TPANIC_0760) were not identified in any of the strains in Japan, while most of the strains (9 of 11) in China harbored at least 1 mutation in this gene. Of the 3 SNPs (i.e., A366T, I415F, and I415M) in the TPANIC_0760 gene, only I415F has been suggested to have a deleterious effect on the protein’s structural flexibility or its binding constant for substrate stability 9 . However, the effect of these SNPs on the possible generation of the penicillin-resistance is not known at present, because there has been no documentation of penicillin-resistant TPA strain so far, although penicillin has been used extensively to treat syphilis for more than 70 years.

Apart from that, in this study, the separation between the clusters of TPA strains in China and in Japan shown by the phylogenetic analyses was found, for most of the strains, in the SNP analyses of the TPANIC_0760 gene, although there were 2 exceptional strains in China that had no mutation in this gene (SMUTp_04 and X-4, Fig. 2b). In this context, it is noticeable that the strain X-4 was one of those 2 exceptional strains in China, and was phylogenetically separated from other strains in China (Fig. 2b). This strain was, rather, very closely related to a Japanese strain 15A011MM forming a small sub-cluster which was branch from node VI (Fig. 2b). These lines of information implied that this small sub-cluster might reflect the limited case(s) of direct international spread of TPA between China and Japan by the ‘Japanese type’ strain(s) (with TPANIC_0705 A506V mutation and without any mutation in TPANIC_0760 gene) in the recent year.

Finally, this study confirmed a Nichols-lineage strain carrying the pbp1 P564L mutation. This SNP has been commonly observed in, and limited to, the genomes of SS-14 strains 9,11 . The study reported here strongly suggested that TPA strains in any lineage could carry mutations in pbp1.

In conclusion, most of the TPA strains in Japan in this study had a close relationship to TPA strains in China, forming the EAC. The MRCA of the EAC is likely to have become extant about 2006. The TPA strains in China and Japan subsequently formed a separate cluster in each country in about 2007. The genomes of the TPA strains in each country then expanded their phylogenetic diversity during the time that country had an increasing number of syphilis cases. In addition, phylogenetic analysis showed that TPA strains from MSM cases in Japan clustered separately from strains from heterosexual cases. These findings, within the context of the recent global resurgence of syphilis, provide a better understanding of the phylogenetic features and transmission networks of syphilis, both domestic and global.

5. Conclusion

This study highlighted the spatiotemporal distribution, R0, and predicted the future epidemic size of COVID-19 in Bangladesh. Moreover, we also evaluated the genomic epidemiology of SARS-CoV-2 circulating in Bangladesh. These information are crucial to control and mitigates the pandemic situation in any country or territories, including in Bangladesh. This study can inform health policymakers to successfully take appropriate preventive measures and interventions to break the transmission chain and control the epidemic. Insight into the phylodynamic and clade and lineage diversity of the virus will help to develop a potential vaccine for the Bangladesh context. Finally, we strongly recommend continuous genomic surveillance to understand the strains' diversity and detect new variants of SARS-CoV-2 for proper control of current pandemic and design effective vaccine globally.


Understanding the evolutionary history of bats is important not just for the study of Chiropteran zoology but also for the study of bats as reservoirs of deadly human viruses. Knowledge of an accurate phylogeny improves analysis of positive selection in bat genomes because dN/dS analysis requires both gene alignments and a phylogenetic relationship of the orthologs being analyzed. Also, the reliable identification of gene orthologs will allow molecular biologists to functionally test differences in these genes from one species to the next (62). Functional studies such as these will allow us to understand whether some bats have unique features of their immunity that allow them to harbor viruses that are dangerous to humans (63). Herein, we have curated multiple sequence alignments of thousands of bat genes. Using both genomic and transcriptomic data, we were able to find 11,677 orthologous gene families. To enhance these alignments, we provided transcriptome data for two of these bats, H. monstrosus and R. aegyptiacus, from which we annotated 7,858 and 9,682 genes, respectively. We furthermore developed a general data cleaning method for filtering exons with nonrandom structural errors, in this case observed to result from genomic vs. transcriptomic data. For this method, we developed the MIXR software package which directly detects and removes alternate consensus run artifacts, and is available at The multiple sequence alignments that we have created for bat genes, both before and after exon filtering, are available for use by the wider bat and virology communities ( Using these alignments, we examined the history both of speciation and of positive selection in 18 species of bats. This study hopefully sets the stage for continued and more in-depth study of the evolution and functional differentiation of bat genes relevant to immunity and beyond.

Using these orthologous gene families, we were able to reconstruct the phylogeny of the order Chiroptera using multiple methods. Due to the sheer scale of the data, we resolved each node in the tree with 100% reported posterior probability, although the topology differed slightly depending on the analysis method. Our results support the division of Chiroptera into the two suborders Yinpterochiroptera and Yangochiroptera, in disagreement with the traditional division into Megachiroptera and Microchiroptera. However, we acknowledge that rooting ancient clades continues to be a difficult phylogenetic problem, and further data may shed more light on this issue. We furthermore provide evidence for the placement of M. schreibersii, in which we agree with Hoofer and Bussche, supporting their proposal for the separation of Miniopteridae into its own family (53). We also provide evidence for the disruption of proposed subfamily Phyllostominae by D. rotundus. Most intriguingly, we saw M. leucogaster placed in the Myotis genus, which will require further investigation.

Finally, we have analyzed positive selection of genes during the speciation of bats, on a genome-wide basis. Previously, work aimed at describing adaptive evolution in bats primarily focused on their unique traits, selecting families of genes to study for selection. These studies can broadly be divided into two categories, those that dealt with specific life traits, such as echolocation or metabolism related to frugivory (24, 64 ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ –73), and those that were related to pathogens or immunity (74 ⇓ ⇓ ⇓ ⇓ ⇓ –80). There are three studies that used larger datasets, two of which used whole-genome data (13, 23, 24). Unlike our study, Tsagkogeorga et al. (13) used genome-wide data to ask which genes might explain the alternative subordinal topologies supported by their data. Similarly, the Shen et al. (24) study was specifically interested in energy metabolism in bats due to their energy-expensive mode of locomotion, flight. Finally, Zhang et al. (23) used whole-genome data available from two distantly related bats, along with orthologs from a number of mammals, and inferred adaptive evolution in the innate immune pathway and DNA damage checkpoint pathway. Our work, in comparison, includes more Chiropteran species and a holistic analysis of selection in the bat genome. The addition of more species, and the generation of a high-confidence tree for these species, gives us better resolution for detecting adaptive evolution specifically within Chiroptera (81).

We find that the positively selected genes in bats are dominated by genes involved with immunity. This could have something to do with the high pathogen load that bats are thought to carry (3), but on the other hand, this finding is not unusual. Bats now join many other species groups in the finding that immune processes stand out for the strength of positive natural selection that has shaped them. The same has been found in many diverse species groups including primates (82, 83), fish (84), and insects (85). It has been noted that the bats have some usual aspects of their immune systems (86 ⇓ –88), which could be consistent with the evolutionary signatures of rapid sequence evolution that we observe in many genes involved in immunity.

Less clear to us is why so many genes involved in collagen formation seem to be under positive selection. Collagens are a family of structural proteins that form connective tissues in the body, including tendons, ligaments, and skin (89). The walls of veins, arteries, and capillaries also contain collagen (90). Collectively, the different forms of collagen constitute the most abundant protein in mammalian bodies (89). Bat wings consist of a network of collagen (91), and bats often have injuries on their wings which need to heal quickly (92). Recently, Pseudogymnoascus destructans, a fungal pathogen that has killed more than 6 million bats, has been reported to damage collagen (93). It is less clear how pathogens would have placed pressure on collagen-formation pathways in the past, across many species. Alternatively, it seems possible that the demanding physical and physiological constraints inherent in muscle-powered flight put bats at an edge of what is evolutionarily achievable. In that case, one might expect to see, after speciation events, comparatively more positive selection in wing- and flight-related genes compared with genes involved in other adaptations, with collagen an indicator of the former set.

In summary, we have provided a way to combine genomic and transcriptomic data to build reliable multiple sequence alignments. We have created multiple sequence alignments for bat genes and made them publicly available. We have used these to produce a phylogeny and to assess positive selection in bat genes. Despite this progress, bats will continue to present challenges that push the limits of genomics and phylogenetics because of their high levels of sequence divergence.

Creating a phylogenetic tree from my selected publicly-available sequences (WGS) in NCBI - Biology

A comprehensive manual on the NCBI C++ toolkit, including its design and development framework, a C++ library reference, software examples and demos, FAQs and release notes. The manual is searchable online and can be downloaded as a series of PDF documents.


BLAST executables for local use are provided for Solaris, LINUX, Windows, and MacOSX systems. See the README file in the ftp directory for more information. Pre-formatted databases for BLAST nucleotide, protein, and translated searches also are available for downloading under the db subdirectory.

Sequence databases for use with the stand-alone BLAST programs. The files in this directory are pre-formatted databases that are ready to use with BLAST.

This site provides full data records for CDD, along with individual Position Specific Scoring Matrices (PSSMs), mFASTA sequences and annotation data for each conserved domain. See the README file for full details.

This site provides full data extractions in XML and summary data in VCF format. It contains files with information about standard terms used in ClinVar, MedGen, and GTR.

Sequence databases in FASTA format for use with the stand-alone BLAST programs. These databases must be formatted using formatdb before they can be used with BLAST.

This site contains files for all sequence records in GenBank in the default flat file format. The files are organized by GenBank division, and the full contents are described in the README.genbank file.

The protein sequences corresponding to the translations of coding sequences (CDS) in GenBank are collected for each GenBank release..Please see the README file in the directory for more information.

This site contains three directories: DATA, GeneRIF and tools. The DATA directory contains files listing all data linked to GeneIDs along with subdirectories containing ASN.1 data for the Gene records. The GeneRIF (Gene References into Function) directory contains PubMed identifiers for articles describing the function of a single gene or interactions between products of two genes. Sample programs for manipulating gene data are provided in the tools directory. Please see the README file for details.

This site contains GEO data in two formats: SOFT (Simple Omnibus in Text Format) and MINiML (MIAME Notation in Markup Language). Summary text files and supplementary data are also available. Please see the README.TXT file for more information.

This site contains genome sequence and mapping data for organisms in Entrez Genome. The data are organized in directories for single species or groups of species. Mapping data are collected in the directory MapView and are organized by species. See the README file in the root directory and the README files in the species subdirectories for detailed information.

Contains directories for each genome that include available mapping data for current and previous builds of that genome.

This site contains the full taxonomy database along with files associating nucleotide and protein sequence records with their taxonomy IDs. See the taxdump_readme.txt and gi_taxid.readme files for more information.

This site provides data from the PubChem Substance, Compound and Bioassay databases for download via ftp. Full downloads of the databases are available along with daily, weekly and monthly updates for Substance and Compound. Substance and Compound data are provided in ASN.1, SDF and XML formats. See the README files for more information.

This site contains all nucleotide and protein sequence records in the Reference Sequence (RefSeq) collection. The ""release"" directory contains the most current release of the complete collection, while data for selected organisms (such as human, mouse and rat) are available in separate directories. Data are available in FASTA and flat file formats. See the README file for details.

This site contains SKY-CGH data in ASN.1, XML and EasySKYCGH formats. See the skycghreadme.txt file for more information.

Downloadable data for SNP.

This site contains next-generation sequencing data organized by the submitted sequencing project.

FTP download site for NCBI databases, tools, and utilities.

This site contains ASN.1 data for all records in MMDB along with VAST alignment data and the non-redundant PDB (nr-PDB) data sets. See the README file for more information.

This site contains the trace chromatogram data organized by species. Data include chromatogram, quality scores, FASTA sequences from automatic base calls, and other ancillary information in tab-delimited text as well as XML formats. See the README file for details.

This site contains the UniVec and UniVec_Core databases in FASTA format. See the README.uv file for details.

This site contains whole genome shotgun sequence data organized by the 4-digit project code. Data include GenBank and GenPept flat files, quality scores and summary statistics. See the README.genbank.wgs file for more information.

Open-access data generally include summaries of genotype/phenotype association studies, descriptions of the measured variables, and study documents, such as the protocol and questionnaires. Access to individual-level data, including phenotypic data tables and genotypes, requires varying levels of authorization.

Specifications for NCBI data in ASN.1 or DTD format are available on the Index of data_specs page. The "NCBI_data_conversion.html" links to the conversion tool.

A suite of tag sets for authoring and archiving journal articles as well as transferring journal articles from publishers to archives and between archives. There are four tag sets: Archiving and Interchange Tag Set - Created to enable an archive to capture as many of the structural and semantic components of existing printed and tagged journal material as conveniently as possible Journal Publishing Tag Set - Optimized for archives that wish to regularize and control their content, not to accept the sequence and arrangement presented to them by any particular publisher Article Authoring Tag Set - Designed for authoring new journal articles NCBI Book Tag Set - Written specifically to describe volumes for the NCBI online libraries.

This service allows users to download compound or substance records corresponding to a set of PubChem identifiers, which can be supplied manually or through a text file. Numerous download formats are available, including SDF, XML and SMILES.

Subscribe to Web/RSS feeds for updates about NCBI resources.


An online form that provides an interface for researchers, consortia and organizations to register their BioProjects. This serves as the starting point for the submission of genomic and genetic data for the study. The data does not need to be submitted at the time of BioProject registration.

A web-based sequence submission tool for one or a few submissions to the GenBank database, designed to make the submission process quick and easy.

Tool for submission to the GenBank database of Barcode short nucleotide sequences from a standard genetic locus for use in species identification.

A stand-alone software tool developed by the NCBI for submitting and updating entries to public sequence databases (GenBank, EMBL, or DDBJ). It is capable of handling simple submissions that contain a single short mRNA sequence, complex submissions containing long sequences, multiple annotations, segmented sets of DNA, as well as sequences from phylogenetic and population studies with alignments. For simple submission, use the online submission tool BankIt instead.

A command-line program that automates the creation of sequence records for submission to GenBank using many of the same functions as Sequin. It is used primarily for submission of complete genomes and large batches of sequences.

Submit expression data, such as microarray, SAGE or mass spectrometry datasets to the NCBI Gene Expression Omnibus (GEO) database.

This site enables users to submit data to the PubChem Substance and BioAssay databases, including chemical structures, experimental biological activity results, annotations, siRNA data and more. It can also be used to update previously submitted records.

The SNP database tools page provides links to the general submission guidelines and to the submission handle request. The page has also two specific links for single- or batch submissions of the human variation data using Human Genome Variation Society nomenclature.

A single entry point for submitters to link to and find information about all of the data submission processes at NCBI. Currently, this serves as an interface for the registration of BioProjects and BioSamples and submission of data for WGS and GTR. Future additions to this site are planned.

This link describes how submitters of trace data can obtain a secure NCBI FTP site for their data, and also describes the allowed data formats and directory structures.


Performs a BLAST search for similar sequences from selected complete eukaryotic and prokaryotic genomes.

Performs a BLAST search of the genomic sequences in the RefSeqGene/LRG set. The default display provides ready navigation to review alignments in the Graphics display.

Finds regions of local similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families.

Allows you to retrieve records from many Entrez databases by uploading a file of GI or accession numbers from the Nucleotide or Protein databases, or a file of unique identifiers from other Entrez databases. Search results can be saved in various formats directly to a local file on your computer.

A stand-alone application for classifying protein sequences and investigating their evolutionary relationships. CDTree can import, analyze and update existing Conserved Domain (CDD) records and hierarchies, and also allows users to create their own. CDTree is tightly integrated with Entrez CDD and Cn3D, and allows users to create and update protein domain alignments.

COBALT is a protein multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.

A stand-alone application for viewing 3-dimensional structures from NCBI's Entrez retrieval service. Cn3D runs on Windows, Macintosh, and UNIX and can be configured to receive data from most popular web browsers. Cn3D simultaneously displays structure, sequence, and alignment, and has powerful annotation and alignment editing features.

Identifies the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (Reverse Position-Specific BLAST) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD).

Tools that provide access to data within NCBI's Entrez system outside of the regular web query interface. They provide a method of automating Entrez tasks within software applications. Each utility performs a specialized retrieval task, and can be used simply by writing a specially formatted URL.

Tool for aligning a query sequence (nucleotide or protein) to GenBank sequences included on microarray or SAGE platforms in the GEO database.

This tool compares nucleotide or protein sequences to genomic sequence databases and calculates the statistical significance of matches using the Basic Local Alignment Search Tool (BLAST) algorithm.

NCBI's Remap tool allows users to project annotation data and convert locations of features from one genomic assembly to another or to RefSeqGene sequences through a base by base analysis. Options are provided to adjust the stringency of remapping, and summary results are displayed on the web page. Full results can be downloaded for viewing in NCBI's Genome Workbench graphical viewer, and annotation data for the remapped features, as well as summary data, is also available for download.

An integrated application for viewing and analyzing sequence data. With Genome Workbench, you can view data in publically available sequence databases at NCBI, and mix these data with your own data.

An interactive web application that enables users to visualize multiple alignments created by database search results or other software applications. The MSA Viewer allows users to upload an alignment and set a master sequence, and to explore the data using features such as zooming and changing of coloration.

A set of software and data exchange specifications used by NCBI to produce portable, modular software for molecular biology. The software in the Toolbox is primarily designed to read records in Abstract Syntax Notation 1 (ASN.1) format, an International Standards Organization (ISO) data representation format.

A public domain quality assurance software package that facilitates the assessment of multiplex short tandem repeat (STR) DNA profiles based on laboratory-specific protocols. OSIRIS evaluates the raw electrophoresis data using an independently derived mathematically-based sizing algorithm. It offers two new peak quality measures - fit level and sizing residual. It can be customized to accommodate laboratory-specific signatures such as background noise settings, customized naming conventions and additional internal laboratory controls.

A graphical analysis tool that finds all open reading frames in a user's sequence or in a sequence already in the database. Sixteen different genetic codes can be used. The deduced amino acid sequence can be saved in various formats and searched against protein databases using BLAST.

The Primer-BLAST tool uses Primer3 to design PCR primers to a sequence template. The potential products are then automatically analyzed with a BLAST search against user specified databases, to check the specificity to the target intended.

A utility for computing alignment of proteins to genomic nucleotide sequence. It is based on a variation of the Needleman Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, ProSplign is accurate in determining splice sites and tolerant to sequencing errors.

PUG provides access to PubChem services via a programmatic interface. PUG allows users to download data, initiate chemical structure searches, standardize chemical structures and interact with the E-utilities. PUG can be accessed using either standard URLs or via SOAP.

Standardization, in PubChem terminology, is the processing of chemical structures in the same way used to create PubChem Compound records from contributors' original structures. This service lets users see how PubChem would handle any structure they would like to submit.

PubChem Structure Search allows the PubChem Compound Database to be queried by chemical structure or chemical structure pattern. The PubChem Sketcher allows a query to be drawn manually. Users may also specify the structural query input by PubChem Compound Identifier (CID), SMILES, SMARTS, InChI, Molecular Formula, or by upload of a supported structure file format.

A variety of tools are available for searching the SNP database, allowing search by genotype, method, population, submitter, markers and sequence similarity using BLAST. These are linked under ""Search"" on the left side bar of the dbSNP main page.

Provides a configurable graphical display of a nucleotide or protein sequence and features that have been annotated on that sequence. In addition to use on NCBI sequence database pages, this viewer is available as an embeddable webpage component. Detailed documentation including an API Reference guide is available for developers wishing to embed the viewer in their own pages.

A utility for computing cDNA-to-Genomic sequence alignments. It is based on a variation of the Needleman-Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, Splign is accurate in determining splice sites and tolerant to sequencing errors.

A tool for creating and displaying phylogenetic tree data. Tree Viewer enables analysis of your own sequence data, produces printable vector images as PDFs, and can be embedded in a webpage.

A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. VecScreen searches a query sequence for segments that match any sequence in a specialized non-redundant vector database (UniVec).

A computer algorithm that identifies similar protein 3-dimensional structures. Structure neighbors for every structure in MMDB are pre-computed and accessible via links on the MMDB Structure Summary pages. These neighbors can be used to identify distant homologs that cannot be recognized by sequence comparison alone.

Genomic V exons from whole genome shotgun data in reptiles

Reptiles and mammals diverged over 300 million years ago, creating two parallel evolutionary lineages amongst terrestrial vertebrates. In reptiles, two main evolutionary lines emerged: one gave rise to Squamata, while the other gave rise to Testudines, Crocodylia, and Aves. In this study, we determined the genomic variable (V) exons from whole genome shotgun sequencing (WGS) data in reptiles corresponding to the three main immunoglobulin (IG) loci and the four main T cell receptor (TR) loci. We show that Squamata lack the TRG and TRD genes, and snakes lack the IGKV genes. In representative species of Testudines and Crocodylia, the seven major IG and TR loci are maintained. As in mammals, genes of the IG loci can be grouped into well-defined IMGT clans through a multi-species phylogenetic analysis. We show that the reptilian IGHV and IGLV genes are distributed amongst the established mammalian clans, while their IGKV genes are found within a single clan, nearly exclusive from the mammalian sequences. The reptilian and mammalian TRAV genes cluster into six common evolutionary clades (since IMGT clans have not been defined for TR). In contrast, the reptilian TRBV genes cluster into three clades, which have few mammalian members. In this locus, the V exon sequences from mammals appear to have undergone different evolutionary diversification processes that occurred outside these shared reptilian clans. These sequences can be obtained in a freely available public repository (

This is a preview of subscription content, access via your institution.

Insect pathogenicity in plant-beneficial pseudomonads: phylogenetic distribution and comparative genomics

Bacteria of the genus Pseudomonas occupy diverse environments. The Pseudomonas fluorescens group is particularly well-known for its plant-beneficial properties including pathogen suppression. Recent observations that some strains of this group also cause lethal infections in insect larvae, however, point to a more versatile ecology of these bacteria. We show that 26 P. fluorescens group strains, isolated from three continents and covering three phylogenetically distinct sub-clades, exhibited different activities toward lepidopteran larvae, ranging from lethal to avirulent. All strains of sub-clade 1, which includes Pseudomonas chlororaphis and Pseudomonas protegens, were highly insecticidal regardless of their origin (animals, plants). Comparative genomics revealed that strains in this sub-clade possess specific traits allowing a switch between plant- and insect-associated lifestyles. We identified 90 genes unique to all highly insecticidal strains (sub-clade 1) and 117 genes common to all strains of sub-clade 1 and present in some moderately insecticidal strains of sub-clade 3. Mutational analysis of selected genes revealed the importance of chitinase C and phospholipase C in insect pathogenicity. The study provides insight into the genetic basis and phylogenetic distribution of traits defining insecticidal activity in plant-beneficial pseudomonads. Strains with potent dual activity against plant pathogens and herbivorous insects have great potential for use in integrated pest management for crops.


Phylogeny of the P. fluorescens…

Phylogeny of the P. fluorescens group based on the core genome. Genomes sequenced…

Overview on insecticidal activity, pathogen…

Overview on insecticidal activity, pathogen suppression and presence of associated gene clusters in…

Overview on insecticidal activity, pathogen suppression and presence of associated gene clusters in 26 strains of the P. fluorescens group. Colored boxes represent activity against insects and plant pathogens as assessed within this study: high activity, medium activity, no activity. Insecticidal activity was assessed in injection assays against G. mellonella larvae and feeding assays against P. xylostella and S. littoralis larvae, and depicted activities are based on the results presented in Figure 3,Table 2, Supplementary Figure S2 and Supplementary Table S4. Disease suppression was assessed in a cucumber-Pythium ultimum assay and activities are based on the data depicted in Supplementary Table S5. Strains indicated by an asterisk were reported to have biocontrol activity against plant diseases in earlier studies (Table 1). In vitro inhibition of mycelial growth was assessed on two media against P. ultimum and Fusarium oxysporum f. sp. radicis-lycopersici and activities are based on the results shown in Supplementary Figure S3. Gray boxes represent presence of selected genes/gene clusters that were found to be associated with insecticidal strains (this study) or that are required for the production of the indicated antifungal metabolites. present, partially present, absent. Exact loci, which were checked for presence/absence, are indicated in Supplementary Table S1. There, additional genes as well as all additional strains are presented. a Selected genes that were identified by comparative genomics to be specific for strains that show insecticidal activity. A complete list is presented in Supplementary Table S6. P. fluorescens insecticidal toxin-cluster (fit), chitinase C (chiC), phospholipase C (plcN), metallopeptidase AprX (aprX), rebB-cluster (rebB), psl-cluster (psl). b Genes that were shown to contribute to insecticidal activity in this study (chiC and plcN) or elsewhere (fit) (Péchy-Tarr et al., 2008 Ruffner et al., 2013). c Presence/absence of gene clusters required for the production of the indicated antifungal metabolites. DAPG, 2,4-diacetylphloroglucinol Phz, phenazine HCN, hydrogen cyanide Prn, pyrrolnitrin Plt, pyoluteorin HPR, 2-hexyl-5-propyl-alkylresorcinol.

Oral and systemic insecticidal activity…

Oral and systemic insecticidal activity is restricted to strains of specific phylogenetic subgroups…

A derivative of P. protegens…

A derivative of P. protegens CHA0 deficient for a specific chitinase is reduced…

Watch the video: Λεμονιά: 8 μυστικά για την καλλιέργειά της - Τα Μυστικά του Κήπου (August 2022).