Available Protein sequence alignment dataset and HMM model

Available Protein sequence alignment dataset and HMM model

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am new to biology and I find my algorithm may be used in the Protein sequence alignment, since it is a henced HMM model. I find that people use HMM to generate noisy copies of the consensus sequence of different lengths. There is a figure show the process:

It seems Professor Richard Durbin may release some dataset. But how can I find available datasets? I lost in the bio-vocabulary and fail to find datasets. Also I am wondering whether or not this topic is very minor in this community?

Update: Since I may abuse the vocabulary, it may be better to post the original word I read.

DNA and protein sequences (both are reasonable to use in HMMs) are available at a variety of sources, such as EMBL, NCBI, and others.

To input them into a model such as an HMM which draws on comparisons between different sequences, you will most likely need to produce a sequence alignment, which is a data format in which the sequences are processed into a matrix (generally not delimited in any way but rather where each column is a single string character) where columns or positions of the matrix are considered to occupy the same position in the sequence. These positions can be "gaps", represented usually by "-". The diagram at the top of the Durbin figure is such an alignment.

There are many many many tools for aligning sequences, both protein and DNA (DNA is a "nucleotide"; RNA is also a "nucleotide" sequence, but DNA is much more common to align in most applications).

Some of the common tools for generating multiple sequence alignments are ClustalO and MAFFT. Notably, you are not as interested (I think) in tools such as BLAST which are mostly about searching single sequences against databases using local alignment, rather than for generating multiple sequence alignments for input to other programs.

I believe that with some searching you can find existing multiple sequence alignment databases. People do not usually save them and put them in databases or repositories because usually which data you use is very specific to the problem that you are interested in, and they are quite simple to generate. Most of the databases are likely to be quite old for this reason; I found one such old database (described here) with a bit of googling.

HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment

Sequence-based protein function and structure prediction depends crucially on sequence-search sensitivity and accuracy of the resulting sequence alignments. We present an open-source, general-purpose tool that represents both query and database sequences by profile hidden Markov models (HMMs): 'HMM-HMM–based lightning-fast iterative sequence search' (HHblits Compared to the sequence-search tool PSI-BLAST, HHblits is faster owing to its discretized-profile prefilter, has 50–100% higher sensitivity and generates more accurate alignments.

Author Summary

Sequence-based protein homology detection has been extensively studied, but it remains very challenging for remote homologs with divergent sequences. So far the most sensitive methods employ HMM-HMM comparison, which models a protein family using HMM (Hidden Markov Model) and then detects homologs using HMM-HMM alignment. HMM cannot model long-range residue interaction patterns and thus, carries very little information regarding the global 3D structure of a protein family. As such, HMM comparison is not sensitive enough for distantly-related homologs. In this paper, we present an MRF-MRF comparison method for homology detection. In particular, we model a protein family using Markov Random Fields (MRF) and then detect homologs by MRF-MRF alignment. Compared to HMM, MRFs are able to model long-range residue interaction pattern and thus, contains information for the overall 3D structure of a protein family. Consequently, MRF-MRF comparison is much more sensitive than HMM-HMM comparison. To implement MRF-MRF comparison, we have developed a new scoring function to measure the similarity of two MRFs and also an efficient ADMM algorithm to optimize the scoring function. Experiments confirm that MRF-MRF comparison indeed outperforms HMM-HMM comparison in terms of both alignment accuracy and remote homology detection, especially for mainly beta proteins.

Citation: Ma J, Wang S, Wang Z, Xu J (2014) MRFalign: Protein Homology Detection through Alignment of Markov Random Fields. PLoS Comput Biol 10(3): e1003500.

Editor: Thomas Lengauer, Max-Planck-Institut für Informatik, Germany

Received: October 27, 2013 Accepted: January 8, 2014 Published: March 27, 2014

Copyright: © 2014 Ma et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work is supported by National Institutes of Health grant R01GM089753, NSF CAREER award CCF-1149811 and Alfred P. Sloan Research Fellowship. The authors are also grateful to the University of Chicago Beagle team and TeraGrid for their support of computational resources. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

This Methods article is associated with RECOMB 2014.


Directed sequence design and search database

Natural linker sequences, which are intermediately related to two distantly related proteins, facilitate homology detection in routinely employed sequence search methods. As described in an earlier publication [22], the paucity of natural linkers in the protein sequence space renders homology detection methods ineffective. To overcome this limitation, an approach to populate gaps in the search space, by purposefully designing protein-like linker sequences between all known families of protein folds provided in the SCOP (Structural Classification of Proteins) database [32] was developed earlier [22]. Briefly, in this approach, each protein domain family, for every known fold in the SCOP database, was represented as a collection of profiles. HMM-HMM alignments were performed between related protein families to generate a combined model that captures the inherent preferences and frequencies of residues between the aligned families. A roulette-wheel based approach was then employed to select for preferred residues at each position in the alignment between every related protein family pair. When repeated along the length of the alignment, the approach generated an ‘artificial linker’ sequence that meaningfully incorporated the observed residue propensities between the aligned families. Using this directed design approach, 3611010 designed sequences were generated between 3901 families for 374 folds in the SCOP database [32]. They are individually available as stand-alone downloadable flat files in the NrichD server [29] for use in tandem with any sequence search procedure.

Query dataset

The database of sequence families (Pfam 30) [30] are grouped based on sequence similarity into 16306 protein families in the Pfam database corresponding to 1293837 seed sequences. The domains corresponding to the protein families are represented by a multiple sequence alignment, which constitutes the seed sequences. To retain only a representative set, blastclust was applied to the members of each family, at 60% sequence identity and 90% sequence length coverage, decreasing the number of sequences representing all the Pfam families to 234727.

Fold association for PFAM domains is not always direct since multiple SCOP domains may be associated with a single sequence domain and vice versa. To identify the SCOP domain associations for various PFAM families, we have pooled together PFAM-SCOP associations by integrating a number of datasets. Firstly, we have used the available SCOP domain definitions for each protein of known structure associated with a PFAM entry based on the PDB id(s), as provided in the SCOPe 2.06 [33] database. Secondly, the RCSB has developed a process, based on the HMMER web service, that takes the PDB-Pfam mappings from SIFTS [34] and adds additional mappings to them [35, 36]. This is provided on the RCSB resource as a downloadable file. Thirdly, academic resources such as PDBfam contain PFAM annotations for

99.4% of chains with more than 50 residues [37]. As shown in Fig. 1, the pooled associations of PFAM-PDB-SCOP from the three resources resulted in 4058 fold associations out of 7726 Pfam sequence families with known structure.

Schematic outline of the workflow: Protocol adopted for structure recognition of families of unknown structure. A consensus was drawn from the structural mapping for the sequence families provided by Xu and Dunbrack [34] and PDB to Pfam mapping available in Pfam [30]

Based on our association of Pfam domain families to the SCOP structural domains, our dataset was divided into two sets: “Assessment” set corresponding to Pfam families for which structural (and fold) association is available and “Application” set corresponding to families for which no structure association is currently available.

Assessment set

7726 sequence families were associated with structures and for 4058 families SCOP fold definitions were available for the assigned regions. We considered structural domain associations given in Pfam and PDBfam [34] with an additional condition of better than 60% length coverage of the SCOP domain in order to exclude indiscriminate or false structural associations (Additional file 1: Table S1). These formed the ‘known’ structural associations and were employed to test the strength of our approach. Clans group related protein families together, constituting sequentially divergent families that share common evolutionary ancestry. There are 595 Clans in Pfam 30. The deduction of structure for any one member of the clan translates to the structure and consequently fold association to the other families in the clan [30]. The number of families in each clan ranges from 2 to 254.

Application dataset

The remaining 8580 families that had no structure association available were examined for structure recognition at the fold level by extracting the seed sequences from the alignment. We took one representative query sequence per cluster (blastclust) from each family iteratively, until we found hits in our database using jackhmmer [24], at the parameters used for the assessment set.

Search method: Evaluation and assessment

The workflow has been illustrated schematically in Fig. 1. We employed a sensitive homology detection program, rejuvenating it further by providing a sequence database constituting both natural and designed sequences [29]. This search database, that integrates 3611010 designed sequences with 4694921 natural sequences is available as a resource on the NrichD database as SCOP(v1.75)-NrichD with a total of 8305931 sequences. The search algorithm employed, jackhmmer, is a profile-based iterative sequence search method that builds an HMM (Hidden Markov Model) [24] after the first search and uses it as the query in the successive iterations, re-encoding it after every round. We set an E-value filter of 10 −4 for the reported hits and a maximum of 5 iterations while ensuring the least incidence of profile drifting by making certain that the query protein is present in each iteration. The sequence domain may be associated with single or multiple structural domains corresponding to the same or different structural folds. We minimized cases wherein an equivalent stretch of a sequence domain was associated with different SCOP folds using strict sequence length coverage filters. For assessing the performance of our approach, the families in the “Assessment set” were considered. We quantified the significance of our approach by measuring precision, sensitivity and specificity and identifying criteria to maximize them. These are statistical measures of performance and are represented by the following equations:

For a given query Pfam family, the number of correct fold associations that qualify the imposed thresholds are quantified as TP (True positive) while those that fail are designated as FN (False negative). Similarly, for a given query Pfam family, the number of incorrect fold associations that qualify the imposed thresholds are designated as FP (False positive) while those which are not hits from folds other than the correct fold are considered as TN (True negative).

For each Pfam family, based on the folds of the hits obtained through jackhmmer searches, a SCOP fold is associated with the query sequence. To parse the results obtained for sequence families with no previously known structure, the criteria as determined from the assessment were query length coverage at better than 60% and E-value better than 10 −4 . In addition, further constraints were added to exclude false positives. For the Assessment dataset, we observed that the correct fold was associated with the highest normalized frequency for a given query.

Normalized fold frequency is given by ( frac,iin left[1,n ight]. )

where n is the total number of folds associated with a query sequence and fold(i) represents the number of homologues identified from that fold in the profile search. N is the total number of associations across folds for the query.

Based on the above observation, using normalized fold frequency, we could further rank the associations in our Application dataset as –

Confident* - If the fold with the highest frequency also had an association at greater than or equal to 95% query coverage.

Confident – If the fold with the highest frequency provides the best coverage between 60 and 95%.

Conflict – When the highest fold frequency did not give the best query coverage.

No ambiguity - If there is only a single structural fold associated with a query, we consider the association made at best query coverage.


Identification and analysis of tyrosine recombinases

Previous structural and sequence analyses indicated that YRs generally have two main functional domains: The core-binding (CB) domain binds the recombination DNA site, and the catalytic (CAT) domain catalyzes all DNA cleavage and joining reactions required for recombination (Esposito & Scocca, 1997 Nunes-Düby et al, 1998 Swalla et al, 2003 ). Some YRs have an additional N-terminal arm-binding (AB) domain that recognizes accessory DNA sequences, so-called arm sites, near the recombination sites. Crystal structures showed that the CAT domain has a similar fold in diverse YRs (Guo et al, 1997 Subramanya et al, 1997 Tirumalai et al, 1997 Skaar et al, 2015 ) and comparative sequence analyses revealed two highly conserved regions (referred to as boxes) and three patches with less significant conservation (Esposito & Scocca, 1997 Nunes-Düby et al, 1998 ). Conserved regions include the catalytic residues, i.e., the tyrosine nucleophile and the catalytic pentad RKHRH (Jayaram et al, 2015 ), as well as the hydrophobic protein core. The CB domain is much less conserved on the sequence level, but its structural architecture is also preserved (Swalla et al, 2003 ). In turn, the AB domain is highly variable with substantial structural and sequence diversity between YR family members (Clubb et al, 1999 Fadeev et al, 2009 Szwagierczak et al, 2009 ).

To analyze the diversity of YRs, we employed the following strategy. First, we performed an iterative jackhmmer search against the UniProt reference proteomes database using the prototypical XerD protein from Escherichia coli as an initial query. In every cycle of this search, the hit sequences were aligned and a profile hidden Markov model (profile HMM) was built. Profile HMM is a probabilistic model used to describe characteristic sequence features of the alignment. This profile HMM was then used as a new query in the next search cycle. This iterative procedure allows identification of distantly related homologues of the original query (Johnson et al, 2010 Potter et al, 2018 ). The resulting sequences were then clustered, and the representatives of the clusters were aligned. The alignment was truncated to contain only the CB and CAT regions, which are ubiquitously present in all YR proteins. This resulting alignment was then used to reconstruct the phylogenetic tree with the PhyML package (Fig 1A and Appendix Fig S1). The tree topology was supported by parametric aBayes and non-parametric SH-LRT tests (Anisimova et al, 2011 ). Based on phylogeny, we then divided YRs into subgroups with significant branch supports (over 0.98 and 0.85 for aBayes and SH-LRT, respectively Appendix Table S1). For each subgroup, we created a distinctive profile HMM, which we then used to find all YR homologues in the UniProt reference proteomes collection. For the resulting sequences, we created sequence logos to visualize conserved regions within subgroups (Appendix Figs S2–S4) and analyzed the specific differences between subgroups (Fig 2). We mapped all YR proteins to their genomes of origin and tracked the taxonomic distribution of each subgroup (Fig 1B, Dataset EV1). Finally, we extracted the fifty most abundant YR proteins and characterized their distribution, classification, and putative function (Fig 1C, Dataset EV2).

Figure 1. Diversity and distribution of tyrosine recombinases (YRs)

  1. Maximum-likelihood phylogenetic tree of YRs. Two major groups of YRs, simple and arm-binding (AB) domain-containing YRs, are highlighted in blue and red, respectively. YR subgroups are shown as leaves in the tree. Statistical support for branching was evaluated by aBayes, and for all of the subgroups, its value is more than 0.98.
  2. Taxonomic distribution of YRs. On the top, a schematic tree of the YR phylogeny corresponding to panel (A) is shown (only nodes with statistical support of more than 0.98 are shown). Phylogeny of the bacterial taxa is shown on the left. The abundance of YRs from a specific subgroup in a particular taxon is indicated by different size dots in the plot (colored as in (A)). The exact numbers of genomes are provided in Dataset EV1.
  3. The fifty most abundant YR proteins found in the genomic sequences available from NCBI. The bars indicate YR abundance in different bacterial taxa with distinct colors. The YRs are named by the subgroup name (in bold) and functional classification. The names of simple and AB domain-containing YRs are colored like in (A). NCBI GI numbers for all the sequences are available in Dataset EV2.

Source data are available online for this figure.

Figure 2. Conservation analysis of tyrosine recombinase (YR) subgroups

For each of the subgroups, secondary structures of a representative family member were predicted using Jpred or retrieved from corresponding Protein Data Bank (PDB) entries. Helices and strands are shown as rectangles and arrows, respectively. The tyrosine nucleophile and the catalytic RKHRH pentad are marked. Characteristic structural variations of YRs that are conserved within distinct subgroups are highlighted in red. AB—arm-binding domain CB—core-binding domain CAT—catalytic domain DUF3701—domain of unknown function (Pfam accession number—PF12482).

This analysis showed that all YRs can be classified into two major phylogenetic groups: simple YRs, which consist of a CB and a CAT domain, and complex YRs, which contain an additional AB domain (Figs 1A and 2). Within these main groups, smaller subgroups were identified, which share a generally conserved domain architecture, but vary in specific structural and sequence features (Appendix Fig S1). Notably, YRs within subgroups have a characteristic taxonomic distribution and share similar predicted functions. In the following sections, we summarize the key sequence features and functional characteristics of all major groups and subgroups.

Simple YRs

The first major YR group revealed in our study includes simple YRs. Members of this group usually comprise only CB and CAT domains and can be further classified into fourteen subgroups (Figs 1A and 2, Appendix Fig S1).

The largest subgroup, Xer, mainly contains recombinases that are responsible for chromosome dimer resolution in bacteria and archaea, such as XerC/D, XerH, XerS, and XerA (Carnoy & Roten, 2009 Cortez et al, 2010 Nolivos et al, 2010 Debowski et al, 2012 ). Sequence comparisons revealed that proteins in this subgroup are highly conserved, with numerous residues conserved also outside of the active site pocket and the hydrophobic core (Appendix Figs S2–S4). The subgroup is widely distributed, and its members are present in almost all analyzed bacterial and archaeal classes (Fig 1B, Dataset EV1), which is consistent with the essential role of these proteins. In the remaining taxa, other class-specific simple YRs may compensate for Xer function. For example, in Halobacteria we found a specific type of simple YRs, named Arch1, which resemble Xer but contain short distinct sequence insertion (Fig 2 and Appendix Fig S3). Similarly, Oscillatoriophycideae lack a Xer protein and instead contain members of the separate Cyan subgroup (named after Cyanobacteria, a phylum of the class). Furthermore, the Cand subgroup unites Xer-related YRs from unclassified “Candidate” phyla, a “microbial dark matter” (Rinke et al, 2013 ).

Arm-binding domain-containing tyrosine recombinases

The second large YR group unites proteins that contain, an AB domain in addition to the CB and CAT domains (Appendix Fig S1). The best-characterized members of this group act as integrases of phages or ICEs. This AB domain-containing YR group consists of six major subgroups that are discussed in detail in the following sections.

IntTn916 subgroup

The largest subgroup of AB domain-containing YRs is the IntTn916 subgroup. It is the most diverse among the AB domain-containing YRs and contains integrases of numerous well-documented ICEs and phages. Its members are most highly represented in gram-positive bacteria, but we also found some examples in other taxa, such as Fusobacteria, Synergista, and Chlamydia (Fig 1B). This subgroup contains some of the most abundant AB domain-containing YRs, such as the mycobacterial phiRV2 prophage integrase (Cole et al, 1998 ) and the integrase of the tetracycline resistance-carrying Tn916 transposon (Franke & Clewell, 1981 ), each found in the genomes of about 4,000 bacterial strains (Fig 1C).

Generally, members of the subgroup contain an AB domain on their N-terminus, which features three beta-strands and one alpha helix (Figs 2 and 3), as seen in the NMR structure of the Tn916 integrase AB domain (Wojciak et al, 1999 ). In some cases, the AB domain was not directly predicted by Pfam (Appendix Fig S1), but our subsequent sequence analysis revealed that the AB domain is preserved throughout the subgroup (Fig 3). Another characteristic feature of the IntTn916 subgroup is a conserved beta-stranded insertion between the second and third beta-strands in the CAT domain (Fig 2 and Appendix Fig S3). Recent structural and biochemical work on the Tn1549 integrase showed that this protein segment is important for shaping the DNA substrate for recombination (Rubio-Cosials et al, 2018 ).

Figure 3. Sequence conservation of the arm-binding domains of tyrosine recombinases (YRs)

For each subgroup, web logos were produced after HMM search against the UniProt reference proteomes database and secondary structures were predicted using Jpred or retrieved from corresponding PDB entries (shown below the logos). The logos are colored by residue type, and the typical YR domain composition is shown above the logos as in Fig 2.

Notably, the phage- and ICE-related members of this subgroup do not form separate clusters instead, most clusters contain integrases from both ICEs and phages (Appendix Fig S6). For example, many actinomycete ICE integrases cluster together with the integrases from actinobacterial phages (see cluster pSAM2 in Appendix Fig S6). Interestingly, many YRs within the clusters integrate their respective MGEs at specific genomic sites, with a reoccurring preference for the conserved flanks of essential genes, such as tRNA encoding genes (Appendix Fig S6). A notable exception is the specific cluster that includes the Tn916 and Tn1549 integrases, which insert into AT-rich regions without a strict sequence specificity (Trieu-Cuot et al, 1993 Scott et al, 1994 Wang et al, 2000 Lambertsen et al, 2018 ). This feature might have contributed to the success of the respective MGEs in spreading to a broad range of bacteria.

IntBPP-1 subgroup

The IntBPP-1 is a smaller AB domain-containing YR subgroup, which is closely related to IntTn916. Its members are found in gammaproteobacteria, betaproteobacteria, and phages (Fig 1B). Examples of this subgroup include putative integrases of the Bordetella BPP-1 phage, the Stx2a phage, and the Salmonella Gifsy-2 phage (McClelland et al, 2001 Liu et al, 2004 Ogura et al, 2015 ), the latter being one of the most abundant proteins in this subgroup (Fig 1C). IntBPP-1 YRs feature an AB domain that is annotated as DUF3596 in Pfam (PF12167 Appendix Fig S1) and exhibits a canonical three beta-strand/one helix structure (Fig 3). Similar to IntTn916 members, the IntBPP-1 subgroup features a beta-stranded insertion between the second and third beta-strands in the CAT domain fold (Fig 2 and Appendix Fig S3). Members of the family also have weaker conservation of the first histidine in the catalytic RKHRH pentad (Appendix Fig S4).

IntCTnDOT subgroup

The second largest AB domain-containing YR subgroup is IntCTnDOT. It includes proteins from Bacteroidetes (Fig 1B), such as integrases of the ICE CTnDOT and mobilizable element NBU1 (Shoemaker et al, 1996 Whittle et al, 2002 ), as well as YRs from the Salmonella genomic island 1 (SG1) (Doublet et al, 2005 Douard et al, 2010 ) (Dataset EV3). Initial Pfam annotation suggested that YRs in this subgroup contain only CB and CAT domains, with a substantially larger predicted CB domain than the one found in simple YRs. However, secondary structure predictions previously proposed that the integrase of a prototype CTnDOT element from Bacteroides comprises a canonical AB domain (Kim et al, 2010 ) (Fig 3) and subsequent biochemical experiments confirmed its interaction with subterminal arm DNA sites in the transposon (DiChiara et al, 2007 Wood et al, 2010 ). In agreement, our comparative analysis revealed that the N-terminal segment of all IntCTnDOT members consists of two conserved domains: a canonical CB domain and an upstream AB domain (Fig 3 and Appendix Fig S1). Accordingly, we have updated the corresponding Pfam annotation, which is now available in the new version (Pfam 32.0).

Analyzing sequence logos, we further noted that YRs of the IntCTnDOT subgroup show a weaker conservation of the first arginine residue in the otherwise strictly preserved catalytic RKHRH pentad (Box I in Appendix Fig S2) in the CAT domain. Arginine is present in this position in NBU1, NBU2, and Tn4555 integrases, but it is absent in the integrases of CTnDOT, ERL (S), and Tn5520 elements (Cheng et al, 2000 ). Previous biochemical experiments showed that in the CTnDOT integrase, this residue is functionally substituted by another arginine located further downstream in the protein sequence (Kim et al, 2010 ). Consistently, we found that this alternative arginine is conserved in many integrases in the IntCTnDOT subgroup (see conserved R in IntCTnDOT logo in Appendix Fig S3). Thus, YRs of this subgroup carry the catalytic arginine in one of two alternative locations, resulting in a weaker overall conservation.

IntSXT subgroup

The next large subgroup of AB domain-containing YRs is IntSXT, which comprises integrases of several ICEs, genomic islands, and phages. A characteristic feature of this subgroup is the presence of an N-terminal DUF4102 domain (Appendix Fig S1). This was previously annotated as an AB domain of genomic island integrases (Szwagierczak et al, 2009 ) and contains an additional beta-strand and an alpha helix compared with AB domains of other YRs (Figs 2 and 3). Phylogenetic analysis revealed that two out of six clusters within the IntSXT subgroup contain integrases from both ICEs and phages (Appendix Fig S7). Members of major clusters share distinct genomic insertion profiles, integrating their MGEs near essential genes. For example, integrases of the P4 and Sf6 phages cluster together with various ICE YRs, all of which insert downstream of tRNA genes (P4 cluster, Appendix Fig S7) (Boyd et al, 2009 Van Houdt et al, 2012 ). Similarly, integrases of the epsilon15 phage, the CMGI-3 element, and related elements form a separate cluster, and all target the 3′ flank of the guaA gene involved in GMP biosynthesis (Kropinski et al, 2007 Bi et al, 2012 ) (epsilon15 cluster, Appendix Fig S7). The same pattern is seen for integrases of the Enterobacterial cdt1 phage, the SXT element, and closely related ICEs, all of which insert next to the prfC gene encoding a factor involved in termination of translation (Hochhut & Waldor, 1999 Asakura et al, 2007 ) (SXT cluster Appendix Fig S7). Thus, members of each IntSXT cluster seem to drive their diverse MGEs into specific locations, perhaps owing to characteristic features in the integrase sequences. Their preference for the flanks of conserved genes might help promote their dissemination between species and explain their characteristic taxonomic distribution. In addition, the mixed distribution of ICE and phage integrases suggests that these elements frequently exchange their integrases. This is also supported by previous observations that ICEs with different conjugation machineries have closely related integrases (Cury et al, 2017 ).

IntP2 subgroup

The IntP2 subgroup of AB domain-containing YRs contains integrases from proteobacterial phages, such as HP1 and P2. Another interesting member of this subgroup is the plasmid-borne Rci recombinase, which regulates R64 plasmid conjugation by reshuffling distinct gene segments to generate diverse pili proteins (Komano et al, 1987 Gyohda & Komano, 2000 Roche et al, 2010 ). The CAT domains of YRs in this subgroup are highly similar to those of simple YRs, as also seen with previously determined crystal structures (Hickman et al, 1997 Skaar et al, 2015 ). Most YRs in this subgroup contain an AB domain with a classical fold (Fig 3), except the Rci recombinases that lack the AB domain. In agreement with previous sequence analyses (Boyd et al, 2009 ), our phylogenetic reconstructions suggest that IntP2 YRs are related to the lambda phage integrase however, this clustering is not well supported by statistical analysis (Fig 1A and Appendix Fig S1). Although the well-studied lambda phage integrase is often used as a prototype for the tyrosine recombinase superfamily (Landy, 2015 ), our analysis revealed that it is quite different from other YRs. It contains substantial alterations even in the CAT domain, including an insertion of two beta-strands after the third beta-strand of the canonical fold, and the replacement of the C-terminal alpha helix with a beta-strand (Fig 2, Appendix Figs S3 and S4).

IntDes subgroup

Finally, IntDes is a small subgroup of AB domain-carrying YRs. Its members are found only in the genus Desulfovibrio of Deltaproteobacteria (Fig 1B). This subgroup features specific sequence perturbations in the catalytic core: Namely the first arginine residue of the RKHRH pentad is shifted in comparison with other YRs and the first histidine is substituted with a tyrosine (Appendix Figs S2 and S4). The biological function of these YRs has remained unknown to date.

Identification and classification of integrative and conjugative elements

The vast majority of the YRs that we analyzed remain unannotated in genomic databases. This particularly hinders identification and characterization of YR-carrying MGEs. To test whether our classification system can help predict YR function, we next checked whether the unannotated YRs found in ICE-related subgroups are indeed integrases of ICEs. For this, we examined the YRs’ genomic neighborhood to identify known conjugative machinery proteins (as in Guglielmini et al, 2014 Abby et al, 2016 ). If an integrase was found in proximity (± 100 kb) to known conjugation machinery proteins, then the corresponding region was considered to be a putative ICE (Fig 4A). ICEs retrieved from the ICEberg database were used for benchmarking. This analysis revealed a total of 59 previously unannotated ICEs (Appendix Fig S8, Dataset EV4). The putative ICEs were then further validated by manual identification of their terminal repeat sequences. We confidently identified terminal repeats in 50 out of 59 predicted ICEs. For 49 of these, the conjugation machinery was found within the predicted borders of the element, further confirming their identity. In one predicted element, the conjugation machinery was located outside of the borders (Dataset EV4), suggesting a coincidental co-occurrence of YR and conjugation genes in this instance.

Figure 4. Tyrosine recombinase-based ICE identification and characterization

  1. Overview of the computational pipeline for ICE identification. The genomic regions of the tyrosine recombinase (YR) genes were expanded 100 kb upstream and downstream and analyzed for the presence of conjugation-related genes and repeat sequences.
  2. Structural diversity of YR-carrying ICEs. All ICEs clustered into five subgroups based on their YR classification (left). The numbers of ICEs in each of the subgroups are displayed as bars with numbers (middle). Schematic representations of ICE architectures are shown, aligned by their integrase genes (red symbol, right). Protein open reading frames of various types of conjugation machineries are depicted with different colors as indicated at the bottom of the figure.

To further characterize the detected ICEs, we aimed to reconstruct the naive insertion site (i.e., the bacterial genomic sequence prior to integration) of the identified ICEs and look for such undisrupted sites in closely related genomes. As functional ICEs can move to new genomic sites, successful identification of naive sites can provide ultimate confirmation of their identity and mobile nature. However, identification of such naive sites requires recent mobility of the ICE and may also be challenged by a limited availability of complete genome sequence data for related species in public databases. Nevertheless, we found naive sites for 18 out of the 49 ICEs, which further validates these elements and indicates their recent activity (Dataset EV4, Appendix Fig S9).

YRs in the new ICEs belonged to five YR subgroups (Fig 4B, Dataset EV4), with most examples found in the IntTn916 (23), IntP2 (17) and IntSXT (14) subgroups. To further analyze the detected ICEs, we next reconstructed the phylogeny of their YRs and plotted the genetic structure of their respective conjugation machineries (Fig 4B and Appendix Fig S8). ICEs with closely related YRs were generally associated with closely related conjugation systems, but ICE groups with somewhat more distantly related YR proteins often contained unrelated types of conjugation modules (Fig 4B and Appendix Fig S8). For instance, ICE groups that carry YRs from the diverse IntTn916 and IntSXT subgroups revealed various conjugation systems. In turn, some clusters of the IntSXT YRs and the distinct IntKX YRs associated with the same conjugation system, called MPFG (Fig 4B and Appendix Fig S8), located on different sides of the YR. Altogether, this suggests recurrent exchange of conjugation modules between distantly related ICEs, in accordance with previous reports (Cury et al, 2017 ).

Furthermore, to complete the characterization of the ICEs' mobilization machinery we looked for excisionase (Xis) genes within newly identified and previously reported ICEs (Fig 4B and Appendix Fig S8). Xis regulates the directionality of the recombination reaction in some of the known YR-containing systems (Connolly et al, 2002 Wood & Gardner, 2015 ). We found that only AB-containing YRs are associated with Xis proteins, which may suggest potential cooperation between the AB domain and Xis. Consistent with this idea, a physical interaction was recently proposed for the integrase and Xis of the lambda phage (Cho et al, 2002 Laxmikanthan et al, 2016 ). We could not detect Xis in any of the 15 ICEs with simple YRs from the IntKX subgroup.

Taken together, successful identification of new ICEs confirms the predictive value of our classification system for automated annotation of YR function and demonstrates its utility to improve characterization of the bacterial mobilome.


Many MSA programs are freely available. However, choosing the most suitable program to each dataset is not trivial. The characteristics of the sequences to be aligned, such as the shared identity, as well as their number and length, are aspects that must be assessed in every MSA dependent project. Each MSA program parameterization, such as the choice of substitution matrices and gap opening/extending penalties for example, when available, also strongly affect the final alignment [24]. Running MSA programs with default parameters are usually preferred when no information regarding the sequences to be aligned are available and/or for users without previous knowledge in this particular field of sequence analysis. With that in mind, we chose to benchmark a selection of programs mostly with their default options. Although results presented herein are compatible with current low-cost hardware and timelines of most research projects, they must be used only as guidelines, and we encourage users to carefully study each program’s parameters in order to obtain the best possible output. The BAliBASE suite is a reliable benchmarking dataset, but still might be considered small to meet certain MSA projects [21]. Thus, understanding each programs own limitations are imperative in order to generate reliable results.

As stated in related papers [21, 22], no available MSA program outperformed all others in all test cases. For the first five reference sets, our results indicated that T-Coffee, Probcons, MAFFT and Probalign were definitely superior with regard to alignment accuracy in all BAliBASE datasets, consistent with similar publications [7, 8, 21, 22]. All four programs have a consistency-based approach in their algorithms, thus being a successful improvement in sequence alignment. Despite meeting certain consistency criteria, DIALIGN-TX is based on local pairwise alignments and is known to be outperformed by global aligners [5]. Nevertheless, we observed that the consistency-based approach may not offer alone the highest quality of alignment. CLUSTAL OMEGA did well when aligning some datasets with long N/C terminal ends from full-length sequences (BB) and has no consistency. The presence of these non-conserved residues at terminal ends, on the other hand, contributed to reduce the scores in the alignments generated by T-Coffee and Probcons, which produced the highest SP/TC scores when aligning the truncated sequences (BBS). Despite having an iterative refinement step, which could improve results, Probcons is still a global alignment program, thus being more prone to alignment errors induced by the presence of non-conserved residues at terminal ends [20]. Certainly MAFFT, Probalign and even CLUSTAL OMEGA may be preferred over T-Coffee and Probcons when aligning sequences with these long terminal extensions. The combination of iterative refinement strategy with consistency from local alignments in MAFFT (L-INS-i method) might have contributed to prevent and correct the alignment of the full-length sequences [22]. Similarly, the suboptimal alignments (determined by variations of the Temperature parameter) generated by the partition function of Probalign, might as well improved the ability of this program to deal with sequences with non-conserved terminal extensions [8]. Apparently, the profile HMM of long sequences also improved the alignments produced by CLUSTAL OMEGA.

As for the remaining reference sets of BAliBASE (6, 7 and 9), we observed that the four consistency-based programs mentioned above still generated better alignments, although MUSCLE presented improved results. In some subsets of Reference 9, MUSCLE was either close or better than some of the top four SP/TC scoring programs. At this reference set, the alignment of sequences with linear motifs generated by MUSCLE might be facilitated by Kimura’s distance, the second stage in the progressive alignment of this program. The Kimura distance states that only exact matches contribute to the match score. Although fast, the method has limitations since it does not consider which changes of amino acids are occurring between sequences. This limitation may be reverted in benefit since the program, assuming the same penalty for any amino acid substitution in early steps of progressive alignment, would avoid a distance increase between pairs of close sequences with errors or wildcard residues (any amino acid) at the linear motifs.

In the largest BAliBASE datasets, the use of the multi-core capability of T-Coffee was indispensable in order to evaluate alignment accuracy because, when running in single-core mode, its computational time exceeded by far the pre-established threshold of 2.5 hours. In the biggest dataset (the last subset of Reference 9), T-Coffee took more than nine days to complete the alignment. The parallelization of T-Coffee should certainly be seen as a major improvement to an MSA program, as processing cores are growing in number even in home desktop computers, not to mention more and faster RAM modules. Interestingly, MAFFT was the only program, among the top four SP/TC scoring programs, able to align all reference sets in less than 2.5 hours with the pre-established settings described in the Methodology section. This is most likely due to the flexibility of the “auto” mode of MAFFT to choose the most appropriate method of alignment according to dataset size, changing from high accuracy mode (L-INS-i) to high speed and less accuracy mode (FFT-NS-2) [25]. Although not being the version used in this work, recent improvements in parallelization were also achieved for MAFFT [26], indicating a tendency to make full use of available hardware and reduce time of execution of MSA programs. Besides parallelization, there is still much space for improvement in the field of multiple sequence alignment in performance. E.g., CLUSTAL OMEGA implemented a modified version of mBed [27], which produced fast and accurate guide trees, and managed to reduce computational time and memory requirements to finish the alignment of large datasets. A part from performance, there also much room for accuracy improvements, as some results presented in this study were still far from the BAliBASE reference alignments.

HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features

DNA-binding proteins often play important role in various processes within the cell. Over the last decade, a wide range of classification algorithms and feature extraction techniques have been used to solve this problem. In this paper, we propose a novel DNA-binding protein prediction method called HMMBinder. HMMBinder uses monogram and bigram features extracted from the HMM profiles of the protein sequences. To the best of our knowledge, this is the first application of HMM profile based features for the DNA-binding protein prediction problem. We applied Support Vector Machines (SVM) as a classification technique in HMMBinder. Our method was tested on standard benchmark datasets. We experimentally show that our method outperforms the state-of-the-art methods found in the literature.

1. Introduction

DNA-binding proteins play a vital role in various cellular processes. They are essential in transcriptional regulation, recombination, genome rearrangements, replication, repair, and DNA modification [1]. Proteins which make bond with DNA in both eukaryotes and prokaryotes while performing like activators or repressors are DNA-binding proteins. It has been observed that the percentages of prokaryotes and eukaryotes protein that can bind to DNA are only 2-3% and 4-5%, respectively [2, 3]. There have been a wide variety of experimental methods such as in vitro methods [4, 5] like filter binding assays, chromatin immunoprecipitation on microarrays (ChIP-chip) genetic analysis, and X-ray crystallography which are used to predict DNA-binding proteins. However, these methods are proven to be expensive and time consuming. Therefore, there is a growing demand to find a fast and cost effective computational method to solve this problem.

Most of the computational methods used in the literature to predict DNA-binding proteins formulated the problem as a supervised learning problem. Practically, the number of known DNA-binding proteins is very small compared to the large non-DNA-binding proteins and unknown proteins. DNA-binding protein prediction is often modeled as a binary class classification problem where given a protein sequence as input the task is to predict whether the protein is DNA-binding or not. Note that the challenge here is to select a proper dataset for training and testing incorporating the imbalanced situation. Many supervised learning algorithms have been used in the literature to solve the problem. Among them, Artificial Neural Networks (ANN) [6], Support Vector Machines (SVM) [7, 8], ensemble methods [9], Nave Bayes classifier [10], Random Forest [11], Convolutional Neural Networks [12], Logistic Regression [13], AdaBoost Classifier [5], and so on are well-regarded. Support Vector Machines (SVM) are one of the best performing classifiers used for DNA-binding protein identification [7, 8, 14, 15].

A great number of web based tools and methods are developed for DNA-binding protein prediction and are available for use. In this paper, we would like to mention several of them: DNABinder [7], DNA-Prot [16], iDNA-Prot [11], iDNA-Prot

dis [14], DBPPred [17], iDNAPro-PseAAC [8], PseDNA-Pro [18], Kmer1 + ACC [19], Local-DPP [20], SVM-PSSM-DT [21], PNImodeler [22], CNNsite [12], and BindUP [23]. Most of these methods have used sequence, profile, or structure based features. In structural feature based methods in the literature, features used were structural motifs, electrostatic potential, the dipole moment, and

-carbon only models [13, 24, 25]. On the other hand, sequence based methods often depended on the PSSM profile based information or pseudo-amino-acid compositions [8, 14, 15, 17, 20, 26, 27]. In [28], HMM based profiles were used for generating features for protein fold recognition.

In this paper, we propose HMMBinder, a novel DNA-binding protein prediction tool using HMM profile based features of a protein sequence. Our method uses monogram and bigram features derived from the HMM profile which shows effectiveness compared to the PSSM or sequence based features. We also use SVM as the classifier and standard benchmark datasets to test our method. Using the standard evaluation metrics, our method significantly improves over the state-of-the-art methods and the features used in the literature. We also developed a web server that is publicly available at

The rest of the paper is organized following the general 5-step guideline suggested in [29] for protein attribute prediction. First, benchmark datasets selected for this problem are described followed by a description of the protein representation by extraction of features. Then we describe the classification algorithm that we selected for our approach followed by the performance evaluation techniques deployed in this paper. Lastly, we describe the web server that we developed for this problem. The results section presents the details of the experimental results followed by an analytical discussion. The paper concludes with a summary and indication of future work.

2. Methods and Materials

In this section, we provide the details of the materials and methods of this paper. Figure 1 provides a system diagram of our proposed method. For the training phase, all the protein sequences are fed to HHBlits [30], a sequence-to-sequence alignment software using the latest UniProt database. HHBlits produces HMM file as output which is then used by our feature extraction method to generate monogram and bigram features. Monogram and bigram features are concatenated together and then used as training feature set to train the classifier. We use SVM with linear kernel as the classification algorithm and a trained model is stored for the testing phase. Testing phase is also similar to the training phase however, the labels for the test dataset are not given to the classifier. This stored model is also used for the web server implementation of HMMBinder.

2.1. Datasets

Selection of benchmark datasets is essential in classification and prediction design. In this paper we use a popular benchmark dataset called benchmark1075 to train our model. Later we test the performance using cross validation and on a separate independent test set known as independent186 dataset. This section provides a brief overview of these two datasets. Both of these datasets are widely used in the literature of DNA-binding protein prediction literature [8, 14, 18, 20, 31].

2.1.1. Dataset Benchmark1075

This dataset was first introduced in [14]. This dataset consists of 1075 protein sequences. Among them, 525 are DNA-binding and 550 are non-DNA-binding protein sequences. All the protein sequences were taken from PDB [32]. This dataset is one of the largest DNA-binding protein prediction datasets and thus suitable for training purpose.

2.1.2. Dataset Independent186

Lou et al. [17] constructed this independent dataset consisting of 93 DNA-binding and 93 non-DNA-binding protein sequences. They used BLASTCLUST [33] on the benchmark dataset to remove the sequences that have more that 25% of similarity.

2.2. Feature Extraction

used for a binary classification problem consists of two types of instances: positive and negative. Formally,

Next, the task is to represent each protein instance as feature vectors suitable for training. The idea is to represent each of the protein instances as a vector of features.

, is shown as a feature vector with dimension

. Most of the methods in the literature of DNA-binding protein prediction use either sequence and PSSM profile based features or structure based features. To the best of our knowledge, there has been no application of features using HMM profiles. In this paper, we have used HHBlits [30] to generate HMM profiles. HMM profiles are comparatively more effective [30, 34] for remote homology detection. HMM profiles were generated using four iterations of HHBlits with a cutoff value set to 0.001 using the latest UniProt database [35]. HMM profiles are

matrix produced by HHBlits. These 20 values are the substitution probability of each type of amino-acid residue along the protein sequence at each position. These values are first converted to linear probabilities using the following formula:

We generated two types of features, monogram and bigram, using the generated HMM profile matrix noted here as

. We provide a brief description of monogram and bigram features extracted from the HMM profile matrix.

2.2.1. Monogram Features

Monogram features [36] are calculated taking the normalized sum of the column wise substitution probability values. Size of these feature group is 20 because of 20 different amino acids. The feature can be defined formally as follows:


Although the relative performance of MSA methods depended on the dataset, in most cases, UPP produced alignments with lower SP-error rates and higher TC scores than MAFFT, Muscle, and Clustal-Omega. ML trees computed with UPP alignments were also more accurate than ML trees for the other alignments. However, the comparison between UPP and PASTA is more interesting. Because UPP uses PASTA to compute its backbone alignment and tree, by design, UPP is identical to PASTA for fragment-free datasets containing at most 1000 sequences. The comparison between UPP and PASTA with respect to alignment accuracy is interesting: UPP alignments tend to have lower SP-error rates than PASTA alignments but also lower TC scores, indicating that these two criteria are not that well correlated. However, ML trees based on PASTA alignments (for fragment-free datasets) are typically more accurate than ML trees based on UPP alignments. For datasets with fragmentary sequences, UPP has nearly the same SP-error rates that it achieves with the full-length sequences, while PASTA’s SP-error rates increase substantially with fragmentation consequently, UPP’s ΔFN tree error rates do not tend to increase that much with fragmentation although they do for PASTA. Thus, UPP is highly robust to fragmentary data whereas PASTA is not. Hence, while PASTA has an advantage over UPP for datasets without fragments, UPP presents advantages relative to PASTA for datasets with fragments.

To understand UPP’s performance, it is useful to consider the alignment strategy it uses. First, it computes a backbone alignment using PASTA for a relatively small (at most 1000-sequence) dataset this allows it to begin with a highly accurate alignment. Then, instead of using a single profile HMM to represent its backbone alignment, UPP uses a collection of profile HMMs, each on a subset of the sequences. The subsets are obtained from local regions of the backbone tree, which is an ML tree estimated for the backbone sequences. Hence, the sequences in these subsets tend to be closely related. The induced subset alignments for these smaller localized regions are thus better suited for HMMs, especially when the full dataset displays overall substantial heterogeneity.

These observations help explain why using multiple HMMs, each for a region within the backbone tree, provides improved alignments compared to using a single HMM. However, UPP also restricts the backbone to the full-length sequences, and this algorithmic step is critical to improving robustness to fragmentary sequences. Hence, these aspects of UPP’s algorithmic design – restricting the backbone to full-length sequences and using an ensemble of HMMs instead of a single HMM –increase sensitivity to remote homology (especially for fragmentary sequences) and reduces alignment SP-error and tree error, but each targets a different aspect of algorithmic performance.

UPP exhibits great scalability with respect to running time (which scales in a nearly linear manner), parallelism, and alignment accuracy. For example, our study showed the alignment SP-error for the backbone alignment is quite close to the alignment SP-error for the alignment returned by UPP. Thus, UPP enables large datasets to be aligned nearly as accurately as smaller datasets.

Overall, UPP is a MSA method that can provide very high accuracy for sequence datasets that have been considered too difficult to align, including datasets with high rates of evolution, fragmentary sequences, or many thousands of sequences – even up to one million sequences. UPP performs well for both phylogenetic and structural benchmarks (see [25] for further discussion of these related but different tasks). Finally, UPP is parallelized (for shared memory) and has a checkpointing feature, but does not require supercomputers to achieve excellent accuracy for ultra-large datasets in reasonable time frames.


In bioinformatics, multiple sequence alignment is a foundermental conception. It aim to align more than two biomolecular sequences and applied for various biological analysis tasks, for example, protein structure prediction and phylogenetic inference [1]. Using MSA to find sequence differences can assist in the construction and annotation of biological ontologies, for example, the largest ontology in the world, Gene Ontology [2], on which researchers conduct a lot of works [3–7]. For the purpose of extracting and sharing knowledge of alignment, researchers established some ontologies based on multiple sequence alignment [8]. In addition, multiple sequence alignment could help to call SNP and thus to find disease-related gene variants [9–13].

There are many types of methods for multiple sequence alignment, and most of them are progressive [1]. Using a progressive method to align a set of sequences, first of all, for each paired sequence, we need to do pairwise alignment, then to compute the distance of the pair. A distance matrix was constituted from the distances of every pair. Subsequently, a guide tree was generated on the basis of the distance matrix. As the last step, on the ground of the provided order, which offered by the guide tree, profile-profile alignment was executed progressively.

For two sequences, the pairwise alignment task simply applies dynamic programming. And the scoring function for dynamic programming is usually based on a substitution matrix, for example, BLOSUM62 and PAM250 for protein sequences. In the multiple sequence alignment problems, when we need to align given sequences x and y, also the algorithms apply dynamic program, however the scoring function is not simply based on certain substitution matrix any more, since if residue xi should be aligned with residue yj is not just concerned about sequences x and y but also concerned about others. Numerous algorithms utilize the posterior probability P(xiyj|x,y) to compute the substitution scores. P(xiyj|x,y) represent the probability that residue on position xi in sequence x and residue on position yj in sequence y are matched in the “true” multiple sequence alignment [14].

For the sake of calculating the posterior probability, a large number of approaches are practiced by different algorithms. Among those considerable amount of progressive alignment algorithms, most of them apply Hidden Markov Model to calculate the posterior probability, for example, ProbCons [15]. But in the meantime, some algorithms apply other probability consistency approaches, for instance, partition function, which was applied by Probalign [16] to calculate the posterior probability.

Howell et al. [17] and McCaskill et al. [18] use partition function to predict RNA secondary structure. Song et al. [19] use partition function to align RNA pseudoknot structures. Using partition function to do alignment was pioneered by Miyazawa [20]. Wolfsheimer et al. [21] studied the parameters partition function for the alignment. MSARC use a residue clustering method based on partition function to align multiple sequence [22]. Retzlaff et al. [23] use partition function as a part of calculation for partially local multi-way alignments. Partition function is a useful model for alignment.

Some algorithms apply integrated approaches, for instance, MSAProbs [24] and QuickProbs [25] calculate the posterior probability according to the combination of HMM and partition function, while for GLProbs [26], based on the mean of sequences’ identity in a set, the posterior probability was calculated adaptively. These papers indicated that, a preferable result will be produced by combining two or more types of posterior probability, while the one using a single type will produce worse result.

For the purpose of optimizing the parameters of HMM in MSA problem, many kinds of optimization algorithms are employed by various algorithms, such as Particle Swarm Optimization [27–30], Evolutionary Algorithms [31] and Simulated Annealing [32], to make the alignment’s accuracy improved.

Won et al. [33] use an evolutionary method to learn the HMM structure for prediction of protein secondary structure. Rasmussen et al. [27] use a particle swarm optimization—evolutionary algorithm hybrid method to train the hidden Markov model for multiple sequence alignment. Long et al. [28] and Sun et al. [29] use quantum-behaved particle swarm optimization method to train the HMM for MSA. And Sun et al. [30] also use an random drift particle swarm optimization methods to train the HMM for MSA.

Nevertheless, combination of the partition function and the optimized HMM was ignored by these studies. So, a novel algorithm for MSA called ProbPFP is presented in this paper. ProbPFP integrates the posterior probabilities yield by particle swarm optimized HMM and those yield by partition function.

We compared ProbPFP with 13 outstanding or classic approaches, that is, Probalign [16], ProbCons [15], DIALIGN [34], Clustal Ω [35], PicXAA [36], KALIGN2 [37], COBALT [38], CONTRAlign [39], Align-m [40], MUSCLE [41], MAFFT [42], T-Coffee [43], and ClustalW [44], according to the total column score and sum-of-pairs score. The results indicated that ProbPFP got the maximum mean scores among the two benchmark datasets SABmark [40] and OXBench [45], along with the second highest mean score on the dataset BAliBASE [46].


Adams, J. A. Kinetic and catalytic mechanisms of protein kinases. Chem. Rev. 101, 2271–2290 (2001).

Blume-Jensen, P. & Hunter, T. Oncogenic kinase signalling. Nature 411, 355–365 (2001).

Lahiry, P., Torkamani, A., Schork, N. J. & Hegele, R. A. Kinase mutations in human disease: interpreting genotype-phenotype relationships. Nat. Rev. Genet. 11, 60–74, (2010).

Zhang, J., Yang, P. L. & Gray, N. S. Targeting cancer with small molecule kinase inhibitors. Nat. Rev. Cancer 9, 28–39, (2009).

Manning, G., Whyte, D. B., Martinez, R., Hunter, T. & Sudarsanam, S. The protein kinase complement of the human genome. Science 298, 1912–1934 (2002).

Ten Eyck, L. F., Taylor, S. S. & Kornev, A. P. Conserved spatial patterns across the protein kinase family. Biochim. Biophys. Acta 1784, 238–243, (2008).

Middelbeek, J., Clark, K., Venselaar, H., Huynen, M. A. & Van Leeuwen, F. N. The alpha-kinase family: an exceptional branch on the protein kinase tree. Cell. Mol. Life Sci. 67, 875–890 (2010).

Stefely, J. A. et al. Mitochondrial ADCK3 employs an atypical protein kinase-like fold to enable coenzyme Q biosynthesis. Mol. Cell 57, 83–94, (2015).

LaRonde-LeBlanc, N. & Wlodawer, A. A family portrait of the RIO kinases. J. Biol. Chem. 280, 37297–37300 (2005).

Xiao, J., Tagliabracci, V. S., Wen, J., Kim, S.-A. & Dixon, J. E. Crystal structure of the Golgi casein kinase. Proceedings of the National Academy of Sciences 110, 10574–10579 (2013).

Blackford, A. N. & Jackson, S. P. ATM, ATR, and DNA-PK: the trinity at the heart of the DNA damage response. Mol. Cell 66, 801–817 (2017).

Steussy, C. N. et al. Structure of pyruvate dehydrogenase kinase: Novel folding pattern for a serine protein kinase. J. Biol. Chem. 276, 37443–37450 (2001).

Ogden, T. H. & Rosenberg, M. S. Multiple sequence alignment accuracy and phylogenetic inference. Syst. Biol. 55, 314–328 (2006).

Jiang, Y. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome biology 17, 184 (2016).

Chartier, M., Chenard, T., Barker, J. & Najmanovich, R. Kinome Render: a stand-alone and web-accessible tool to annotate the human protein kinome tree. PeerJ 1, e126, (2013).

Möbitz, H. The ABC of protein kinase conformations. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics 1854, 1555–1566 (2015).

Brooijmans, N., Chang, Y. W., Mobilio, D., Denny, R. A. & Humblet, C. An enriched structural kinase database to enable kinome-wide structure-based analyses and drug discovery. Protein Sci. 19, 763–774 (2010).

McSkimming, D. I., Rasheed, K. & Kannan, N. Classifying kinase conformations using a machine learning approach. BMC Bioinformatics 18, 86 (2017).

Creixell, P. et al. Unmasking determinants of specificity in the human kinome. Cell 163, 187–201 (2015).

Rahman, R., Ung, P. M.-U. & Schlessinger, A. KinaMetrix: a web resource to investigate kinase conformations and inhibitor space. Nucleic Acids Res. 47, D361–D366 (2018).

van Linden, O. P., Kooistra, A. J., Leurs, R., de Esch, I. J. & de Graaf, C. KLIFS: A knowledge-based structural database to navigate kinase-ligand interaction space. J. Med. Chem. (2013).

Hartmann, S. & Vision, T. J. Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment? BMC Evol. Biol. 8, 95 (2008).

Kwon, A. et al. Tracing the origin and evolution of pseudokinases across the tree of life. Sci. Signal. 12, eaav3810 (2019).

Magrane, M. & UniProt Consortium. UniProt Knowledgebase: a hub of integrated protein data. Database 2011, bar009 (2011).

Hildebrand, A., Remmert, M., Biegert, A. & Söding, J. Fast and accurate automatic structure prediction with HHpred. Proteins 77(Suppl 9), 128–132, (2009).

Ye, Y. & Godzik, A. FATCAT: a web server for flexible structure comparison and structure similarity searching. Nucleic Acids Res. 32, W582–585 (2004).

Yamaguchi, H., Matsushita, M., Nairn, A. C. & Kuriyan, J. Crystal structure of the atypical protein kinase domain of a TRP channel with phosphotransferase activity. Mol. Cell 7, 1047–1057 (2001).

Zhao, Y. et al. Crystal Structures of PI3Kalpha Complexed with PI103 and Its Derivatives: New Directions for Inhibitors Design. ACS Med. Chem. Lett. 5, 138–142, (2014).

Ferreira-Cerca, S., Kiburu, I., Thomson, E., LaRonde, N. & Hurt, E. Dominant Rio1 kinase/ATPase catalytic mutant induces trapping of late pre-40S biogenesis factors in 80S-like ribosomes. Nucleic Acids Res. 42, 8635–8647, (2014).

Maurice, F., Pérébaskine, N., Thore, S. & Fribourg, S. In vitro dimerization of human RIO2 kinase. RNA Biology In press, 1–10, (2019).

Tso, S.-C. et al. Structure-based design and mechanisms of allosteric inhibitors for mitochondrial branched-chain α-ketoacid dehydrogenase kinase. Proceedings of the National Academy of Sciences 110, 9728–9733 (2013).

Kato, M., Li, J., Chuang, J. L. & Chuang, D. T. Distinct structural mechanisms for inhibition of pyruvate dehydrogenase kinase isoforms by AZD7545, dichloroacetate, and radicicol. Structure 15, 992–1004, (2007).

Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLOS Comput. Biol. 10, e1003926 (2014).

Tsutakawa, S. E., Jingami, H. & Morikawa, K. Recognition of a TG mismatch: the crystal structure of very short patch repair endonuclease in complex with a DNA duplex. Cell 99, 615–623 (1999).

Braschi, B. et al. Genenames. org: the HGNC and VGNC resources in 2019. Nucleic Acids Res. 47, D786–D792 (2018).

Tai, C.-H., Vincent, J. J., Kim, C. & Lee, B. SE: an algorithm for deriving sequence alignment from a pair of superimposed structures. BMC Bioinformatics 10, S4 (2009).

Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).

Waterhouse, A. M., Procter, J. B., Martin, D. M., Clamp, M. & Barton, G. J. Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics 25, 1189–1191 (2009).

Zhang, W. et al. Crystal structures of the Gon7/Pcc1 and Bud32/Cgi121 complexes provide a model for the complete yeast KEOPS complex. Nucleic Acids Res. 43, 3358–3372, (2015).

Padyana, A. K., Qiu, H., Roll-Mecak, A., Hinnebusch, A. G. & Burley, S. K. Structural basis for autoinhibition and mutational activation of eukaryotic initiation factor 2alpha protein kinase GCN2. J. Biol. Chem. 280, 29289–29299, (2005).

Kumar, A. et al. Structure of PINK1 and mechanisms of Parkinson’s disease-associated mutations. eLife 6, (2017).

Christie, M., Boland, A., Huntzinger, E., Weichenrieder, O. & Izaurralde, E. Structure of the PAN3 pseudokinase reveals the basis for interactions with the PAN2 deadenylase and the GW182 proteins. Mol. Cell 51, 360–373, (2013).

Nagae, M. et al. 3D structural analysis of protein O-mannosyl kinase, POMK, a causative gene product of dystroglycanopathy. Genes Cells 22, 348–359, (2017).

Xu, Q. et al. Identifying three-dimensional structures of autophosphorylation complexes in crystals of protein kinases. Sci Signal 8, rs13, (2015).

Crooks, G. E., Hon, G., Chandonia, J.-M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004).

Modi, V. & Dunbrack, R. L. Defining a new nomenclature for the structures of active and inactive kinases. Proceedings of the National Academy of Sciences 116, 6818–6827 (2019).

Jaccard, P. La distribution de la flore dans la zone alpine. Revue générale des sciences pures et appliqué 15(Dec), 961–967 (1907).

Xiong, S. et al. Structural basis for auto-inhibition of the NDR1 kinase domain by an atypically long activation segment. Structure 26, 1101–1115. e1106 (2018).

Hanks, S. K., Quinn, A. M. & Hunter, T. The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. Science 241, 42–52 (1988).

Hunter, T. In Methods Enzymol. Vol. 200 3–37 (Elsevier, 1991).

Talavera, G. & Castresana, J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 56, 564–577 (2007).

Kumar, S., Stecher, G., Li, M., Knyaz, C. & Tamura, K. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol. 35, 1547–1549 (2018).

Letunic, I. & Bork, P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 44, W242–W245 (2016).

Lemoine, F. et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature 556, 452 (2018).

de Cárcer, G., Manning, G. & Malumbres, M. From Plk1 to Plk5: functional evolution of polo-like kinases. Cell cycle 10, 2255–2262 (2011).

Needham, E. J., Parker, B. L., Burykin, T., James, D. E. & Humphrey, S. J. Illuminating the dark phosphoproteome. Sci. Signal. 12, eaau8645 (2019).

Sauder, J. M., Arthur, J. W. & Dunbrack, R. L. Jr. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins: Structure, Function and Genetics 40, 6–22 (2000).

Yona, G. & Levitt, M. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol. 315, 1257–1275 (2002).

Fox, G., Sievers, F. & Higgins, D. G. Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments. Bioinformatics 32, 814–820 (2015).

Le, Q., Sievers, F. & Higgins, D. G. Protein multiple sequence alignment benchmarking through secondary structure prediction. Bioinformatics 33, 1331–1337 (2017).

Tokumitsu, H., Wayman, G. A., Muramatsu, M. & Soderling, T. R. Calcium/calmodulin-dependent protein kinase kinase: identification of regulatory domains. Biochemistry 36, 12823–12827 (1997).

Osawa, M. et al. A novel target recognition revealed by calmodulin in complex with Ca 2+-calmodulin-dependent kinase kinase. Nat. Struct. Mol. Biol. 6, 819 (1999).

Tokumitsu, H., Muramatsu, M.-a., Ikura, M. & Kobayashi, R. Regulatory mechanism of Ca2+/calmodulin-dependent protein kinase kinase. J. Biol. Chem. 275, 20090–20095 (2000).

Dai, G. et al. Calmodulin activation of polo-like kinase 1 is required during mitotic entry. Biochem. Cell Biol. 91, 287–294 (2013).

Kauselmann, G. et al. The polo-like protein kinases Fnk and Snk associate with a Ca2+-and integrin-binding protein and are regulated dynamically with synaptic plasticity. The EMBO journal 18, 5528–5539 (1999).

Plotnikova, O. V., Pugacheva, E. N., Dunbrack, R. L. & Golemis, E. A. Rapid calcium-dependent activation of Aurora-A kinase. Nature communications 1, 64, (2010).

Mallampalli, R. K., Glasser, J. R., Coon, T. A. & Chen, B. B. Calmodulin protects Aurora B on the midbody to regulate the fidelity of cytokinesis. Cell Cycle 12, 663–673 (2013).

Brinkworth, R. I., Breinl, R. A. & Kobe, B. Structural basis and prediction of substrate specificity in protein serine/threonine kinases. Proceedings of the National Academy of Sciences 100, 74–79 (2003).

Anastassiadis, T., Deacon, S. W., Devarajan, K., Ma, H. & Peterson, J. R. Comprehensive assay of kinase catalytic activity reveals features of kinase inhibitor selectivity. Nat. Biotechnol. 29, 1039 (2011).

Bishop, A. C. et al. A chemical switch for inhibitor-sensitive alleles of any protein kinase. Nature 407, 395 (2000).

Ye, Y. & Godzik, A. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(Suppl 2), 246–255 (2003).

Söding, J., Biegert, A. & Lupas, A. N. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res. 33, W244–248, (2005).

Yamaguchi, M. et al. Cryo-EM of Mitotic Checkpoint Complex-Bound APC/C Reveals Reciprocal and Conformational Regulation of Ubiquitin Ligation. Mol. Cell 63, 593–607, (2016).

Dong, C. et al. The crystal structure of an inactive dimer of PDZ-binding kinase. Biochem. Biophys. Res. Commun. 476, 586–593, (2016).

Eddy, S. R. In Genome Informatics 2009: Genome Informatics Series Vol. 23 205–211 (World Scientific, 2009).

The PyMOL molecular graphics system. (Schrödinger, Inc., San Carlos, CA, 2002).

R: A Language and Environment for Statistical Computing. (R Foundation for Statistical Computing, Vienna, Austria, 2015).

Watch the video: Σωστή χρήση συμπληρωμάτων διατροφής (July 2022).


  1. Abracomas

    I'm sorry, but I think you are wrong. I'm sure. Let's discuss. Email me at PM, we'll talk.

  2. Gilchrist

    very interesting thought

  3. Gaktilar

    Absolutely with you it agree. It is good idea. I support you.

  4. Arashigore

    Make mistakes. Let us try to discuss this. Write to me in PM, speak.

Write a message