PHROGs Documentation

Clustering and annotation pipeline

(1) Protein clustering using similarity searches
The 938,864 initial proteins were compared to each other using MMseqs (Hauser et al. 2016). To be further considered, a protein pair should have : (i) at least an HSP with a bit-score greater than 30 and (ii) more than 80% of the residues of each protein should be involved in at least one HSP found between the two proteins (i.e. coverage >80%). Using each protein pair and the lowest E-value found in an HSP between two proteins, the proteins were clustered with MCL (inflation 2.0 ; Enright et al., 2002).
(2) Grouping the protein clusters using remote homology detection
For each of the 63,673 clusters containing at least two proteins, a multiple alignment was built using ClustalOmega (McWilliam et al., 2013) and an HMM profile was computed for each alignment using HHsearch (Söding 2005; Remmert et al., 2012) using the « -M 50 » parameter so that only columns of the multiple alignments with less than 50 % gaps are match states. All these profiles were then compared with each other. The 85,024 singletons were also compared to the 63,673 cluster profiles. To be further considered, a cluster pair (or a singleton-cluster pair) should have a hit (i) with a probability greater than 90% and (ii) involving at least 60% of the match states of the two HMM profiles (same thresholds when comparing singletons to clusters). Based on these rules, singletons and clusters were then clustered with MCL (inflation 2.0 ; Enright et al., 2002). This resulted in placing 868,340 of the initial 938,864 proteins into 38,880 « super-families of protein clusters » containing at least two proteins, hereafter named PHROGs. Only 70,524 proteins remained as singleton, or ORFan (7.5 % of the protein dataset). Multiple alignments and HMM profiles were computed for all the 38,880 PHROGs containing at least two proteins using ClustalOmega (McWilliam et al., 2013) and HHsuite (Remmert et al., 2012) and all PHROGs were compared to each other using HHsuite.

User's guide

If you cannot see the embeded guide, download it here :

Comparing your sequence data to PHROGs : MMseqs

With MMseqs you can compare a single protein or a set of distinct unknown proteins to PHROGs via a profile-to-sequence comparison.
Start by downloading the PHROGs profile database for MMseqs :

(1) Untar the database and move to its directory to use it :


                  				tar -xzf phrogs_mmseqs_db.tar.gz


                  				cd phrogs_mmseqs_db

(2) Create a database with your fasta file :


                  				mmseqs createdb your_seq.faa target_seq

(3) Compute the search and convert the results into a tab separated file :


                  				mmseqs search phrogs_profile_db target_seq results_mmseqs ./tmp -s 7


                  				mmseqs createtsv phrogs_profile_db target_seq results_mmseqs results.tsv

-s sets the sensitivity of the search, 5.7 is the default value while 8.5 is the maximum value. This option adjusts the sensitivity of the prefiltering and influences the run time, details are given in the MMseqs wiki.

Note : ./tmp is a temporary directory, it is created by default when launching the command.

The results.tsv file should look like this :


                  				phrog_10000    NC_021299_p59    258    0.949    6.200E-94    0    127    128    0    127    128

The first column is the matching PHROG while the second column is the ID of your sequence, then you have alnScore seqIdentity eVal qStart qEnd qLen tStart tEnd tLen, respectively (Note that residues numbering starts at 0).
If you wish to know which proteins of the PHROGs hit your sequence you can run :


                  				mmseqs createtsv phrogs_profile_db target_seq results_mmseqs results.tsv --full-header

Table column values are described in a special section of the MMseqs wiki, please see MMseqs output table for more information.

This protocol was constructed and tested with MMseqs2 v6.f5a1c.

NOTE : If your query is not similar to any PHROG, you can try screening a larger database like UNICLUST or NR to build an enriched query set and have better results against PHROGs.

If you prefer to use your favorite software, all PHROGs are available as raw fasta and multiple alignment files :

Wondering how we built the indexed database? check out our detailed comparison guide.

Comparing your sequence data to PHROGs : HHsuite

With HHsuite you can easily compare a single sequence, a multiple sequence alignment or a HMM profile to the HMM profiles of all PHROGs using the following database and commands. If you have a multi-fasta file with a set of unknown proteins you'll have to build your custom index as specified below.
A good start is to download the indexed database here :

(1) Untar the database and move to its directory to use it :


                  				tar -xzf phrogs_hhsuite_db.tar.gz


                  				cd phrogs_hhsuite_db

(2) Compute the search :


                  				hhblits -i your_seq.faa -d phrogs -n 1 -o results_your_seq_VS_phrogs -blasttab tsv_file


                  				hhsearch -i your_seq.faa -d phrogs -n 1 -o results_your_seq_VS_phrogs -blasttab tsv_file

Note that with these two commands the input file must be a single sequence, a multiple sequence alignment (accepted formats : fasta, a2m, a3m), or a HMM profile in hhm format.
hhblits is the faster command, its use is recommended when querying a large alignment file, while hhsearch is a bit more sensitive due to the absence of a prefiltering step. Try running hhsearch if hhblits doesn't retrieve any hits.
The -n option specifies the number of iterations, in this example only one iteration was performed. Up to 8 iterations are possible, adding the hit sequences to the query in order to increase sensitivity and find distant homologs. Run time will be affected, the more iterations the longer the run time.
The -blasttab option writes the results in a tab separated file.

It is possible to query several distinct proteins in a multi-fasta file, by running hhblits_omp or hhsearch_omp after building an index with your query sequences.

This custom index can be built out of a multi-fasta file using ffindex_from_fasta, a tool from the HHsuite module :


                          ffindex_from_fasta -s your_multifasta.ff{data,index} your_multifasta.faa

Then you can compute the search :


                          hhblits_omp -i your_multifasta -d phrogs -o your_multifasta_VS_phrogs -blasttab tsv_file

These last two commands were run in the phrogs_hhsuite_db directory, which means that your multi-fasta should be in that directory to run these exact commands. Otherwise, you can specify the path to each of the indexed databases.

As for the output file, results_your_seq_VS_phrogs will contain information on the query, a table with all the similar PHROGs and the alignment between the query and the hit PHROG.
The tsv_file will only contain the table of hits, presented with the same headers as the BLAST outfmt 6 table.

When querying a multi-fasta, all results will be written in your_multifasta_VS_phrogs.ffdata and the output tabulated file will be a binary in ffdata format. You can convert it to a regular file using :


                          tr -cd '\11\12\15\40-\176' < tsv_file.ffdata > results.tsv

Also note that the protein ID following the PHROG ID on the unformatted results file is the first protein of the PHROG's MSA and not the best hit against your query, for example : phrog_30650 ## KF192053_p74, after the hash symbols you have the first protein.

See HHsuite user's guide for more information.
This protocol was constructed and tested with HHsuite v3.3.0.

NOTE : If your query is not similar to any PHROG, you can try screening a larger database like UNICLUST or NR to build an enriched query set and have better results against PHROGs.

If you prefer to use your favorite software, all PHROGs are available as raw fasta and multiple alignment files :

Wondering how we built the indexed database? check out our detailed comparison guide.

Table of Contents

Clustering and annotation pipeline

User's guide

Comparing your sequence data to PHROGs : MMseqs

Comparing your sequence data to PHROGs : HHsuite