User Manual: How to use the PHROG database with my proteins?

Download PHROGs database
Choose your tool

a. HHsuite3
i. Installation of HHsuite3
ii. Convert MSA fasta format into MSA a3m format
iii Adaptation of PHROGs database to HHsuite3
iv. Using HHblits
v. Using HHsearch
v. Using HHblits_omp and HHsearch_omp
vi. How do I know which PHROG did my sequence match?

b. MMseqs2
i. Install MMseqs2
ii. Adaptation of the PHROGs database to MMseqs
iii. Searching PHROG profiles to query sequences
iv. How do I know which PHROG did my sequence match?

1. Download PHROGs database

PHROGs Database is freely available online via https://phrogs.lmge.uca.fr/index.php where multiple data can be downloaded locally as the user requires it. Users can retrieve fasta amino acid fasta sequences used in database (FAA_phrog.tar.gz), fasta multi sequence alignments (MSA_phrogs.tar.gz), Hidden Markov Model profiles from alignments (HMM_phrog.tar.gz), and a table of current annotation for every single PHROG (phrog_annot_v2.tsv). In order to use the PHROG database as a target database for either MMseqs2 and HHsuite3 tools, MSA and HMM files are recommended.

MSA files can be downloaded to a remote server via command line :

# Create a directory for PHROGs database
mkdir PHROGs
cd PHROGs

mkdir MSA
cd MSA
wget https://phrogs.lmge.uca.fr/downloads_from_website/MSA_phrogs.tar.gz
tar -xvf MSA_phrogs.tar.gz 
cd ../

Notice that PHROGs MSA are in FASTA format.

2. Choose your tool

The PHROG database can be easily adapted to be used either within MMseqs2 (Steinegger and Söding 2016) or HHsuite3 tools (Remmert et al., 2012). Each software requires a different processment of PHROG files as follows.

a. HHsuite3

i. Installation of HHsuite3

Installation of HHsuite3 software can be done via Conda or Docker, please see the process and requirements on https://github.com/soedinglab/hh-suite .

ii. Convert MSA fasta format into MSA a3m format

To create A3M files from PHROG FASTA multi sequence alignments, the reformat.pl script will be used. The following script is called from-fasta-2-a3m.sh and must be created in the parental directory of MSA directory. Since each of the 38,880 PHROG MSA files must be converted independently, this step may take a couple of minutes. Please, be patient.

# Be sure of being in PHROGs directory
nano from-fasta-2-a3m.sh  # Copy the bellowing code
chmod 777 from-fasta-2-a3m.sh
mkdir A3M
./from-fasta-2-a3m.sh

#!/bin/bash

# Code name: from-fasta-2-a3m.sh
# Function: Convert FASTA files into A3M files.


msa=($(ls MSA/FMA_phrog | sed 's/.fma//g'))

for ((i=0; i<${#msa[@]}; i++)); do
        echo "${msa[$i]}"
        perl reformat.pl fas a3m MSA/FMA_phrog/${msa[$i]}.fma A3M/${msa[$i]}.a3m
done

The reformat.pl file is part of the multiple modules stored in HHsuite3. If your version of HHsuite3 does not contain reformat.pl, the script can be downloaded from https://github.com/soedinglab/hh-suite/blob/master/scripts/reformat.pl .

iii. Adaptation of PHROGs database to HHsuite3

In order to use PHROGs’ MSA files, ffindex, ffindex_apply, hhmake and cstranslate modules from HHsuite3 must be applied to the recently created A3M format files of PHROGs. To achieve this goal, cs219.lib library is required. If this library is not available in your HHsuite3 version, it can be downloaded from https://github.com/soedinglab/hh-suite/blob/master/data/cs219.lib .

As a result, MSA files will generate a searchable database. After the following code, 6 files in multiple formats (a3m, cs2019 and hhm) will be created. These files are critical for the use of PHROG as a target database with hhblits or hhsearch modules.

# If you do not have cs219.lib, please use the following code lines. Be sure of being on PHROGs directory

nano cs219.lib # Copy cs2019.lib from HHsuite3 source
chmod 777 cs219.lib

# If you have cs219.lib, please ignore the above code lines.

# Now, continue with the following commands
# Creation of ordered ffindex and ffdata for files in A3M files

ffindex_build -as phrogs_a3m.ffdata phrogs_a3m.ffindex A3M 

ffindex_apply phrogs_a3m.ff{data,index} -i phrogs_hhm.ffindex -d phrogs_hhm.ffdata -- hhmake -i stdin -o stdout -v 0

cstranslate -i phrogs_a3m -A ./cs219.lib -o phrogs_cs219 -b -f -I a3m

iv. Using HHblits

In order to search a single query sequence, a MSA or a HMM profile, into the PHROG database , hhblits is the fastest option of the HHsuite3 modules. Using hhblits is quite easy: it is necessary to use a single sequence file (here, seq.faa) and the recently created MSA database (see code above) as the target database.

hhblits -i seq.faa -d phrogs -n 1 -o results_file

v. Using HHsearch

Another option within HHsuite3 is to use the hhsearch module. It takes more time than hhblits but is reliable as well. Again, a single sequence file is needed as input (seq.faa) and the recently created MSA database (see code above) as the target database.

hhsearch -i seq.faa -d phrogs  -o results_file

vi. Using HHblits_omp and HHsearch_omp

It is also possible to compare several distinct proteins in a multi-fasta file against all PHROGs, using hhblits_omp or hhsearch_omp .

That type of search requires builiding an index of your query fasta as follows :

ffindex_from_fasta -s seq.ff{data,index} seq.faa

Then you can compute the search :

hhblits_omp -i seq -d phrogs -o results_file

All results for each sequence in seq.faa will be appended to results_file.ffdata.

vii. How do I know which PHROG did my sequence match?

First let’s analyse hhsearch or hhblits output, both have the same format. The name of the protein inside the seq.faa file is at the first line, withe the title “Query”. Then, a table of hits is used to show the top matches ordered by their probability of being true positives. As can be seen, the sequence called JN699628_p59 is the best match.

# HHsearch result
Query         NC_021299_p59
Match_columns 128
No_of_seqs    4 out of 5
Neff          2.23736
Searched_HMMs 38880
Date          Mon Mar  1 00:24:48 2021
Command       hhsearch -i seq.faa -d /scratch/rbucio/PVP_test/HHsuite/msa-normal -o result_search 

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 phrog_10000 ## JN699628_p59                   100.0 1.4E-61 3.6E-66  361.2   0.0  128    1-128     1-128 (128)
  2 phrog_1389 ## NC_008265_p31                   57.3       4  0.0001   28.9   0.0  109   14-124     1-114 (114)
  3 phrog_13578 ##NC_028809_p50                   30.0      21 0.00055   27.6   0.0   17   55-71     16-32  (155)
  4 phrog_22794 ##KP027195_p115                   29.2      23 0.00058   23.9   0.0   11    9-19     22-32  (61)

#HHblits result
Query         NC_021299_p59
Match_columns 128
No_of_seqs    27 out of 29
Neff          5.16007
Searched_HMMs 184
Date          Tue Mar  2 02:35:22 2021
Command       hhblits -i seq.faa -d /scratch/rbucio/PVP_test/HHsuite/msa-normal -o hhblits_result 

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 phrog_10000 ## JN699628_p59                   100.0 2.2E-74 2.7E-78  430.7   0.0  128    1-128     1-128 (128)
  2 phrog_1389 ## NC_008265_p31                   96.0 3.3E-05 4.8E-09   52.6   0.0  109   14-124     1-114 (114)
  3 phrog_38278 ## p17072 VI_01301                 59.5    0.78   9E-05   33.0   0.0   32   13-46     41-72  (180)
  4 phrog_33209 ## NC_029027_p11                   40.0     2.7  0.0003   27.9   0.0   30   80-109    40-69  (110)

For more information check out the HHsuite user’s guide at : http://gensoft.pasteur.fr/docs/hhsuite/3.0-beta.2/hhsuite-userguide.pdf .

b. MMseqs2

i. Installation of MMseqs2

Please, see the installation process and requirements of MMseqs2 software on https://github.com/soedinglab/mmseqs2/wiki#install-mmseqs2-for-linux .

ii. Adaptation of the PHROGs database to MMseqs

In order to use the PHROG database with MMseqs2, it is necessary to download the HMM_phrog.tar.gz file available in the PHROGs webpage. This process uses the ffindex_build module from HHsuite3 to create a profile database from the PHROGs HMM profiles.

# Create an exclusive directory  for mmseqs usage
mkdir MMseqs
cd MMseqs

# Download HMM_phrog.tar.gz
wget https://phrogs.lmge.uca.fr/downloads_from_website/HMM_phrog.tar.gz

# Open HMM file
tar -xvf HMM_phrog.tar.gz

# Creation of index and database
ffindex_build -a phrog_hhm_db phrog_hhm_db.index HMM_phrog/

# Conversion of HMM database to profile database
mmseqs convertprofiledb phrog_hhm_db phrog_profile_db

iii. Searching PHROG profiles to query sequences

Now that PHROGs database is currently available to be used by the MMseqs2 tool, it is easy to search PHROG profiles into any single sequence or bunch of sequences. It is necessary to highlight that MMseqs is faster than any HHsuite3 module (hhblits or hhsearch) and it allows the search of multiple sequences simultaneously if they are in a single file.

First, any sequence file or file of sequences must be converted into an MMseqs2 database in order to be suitable for searching. It is recommended to use the PHROG database as the query database, meanwhile the target database will be composed of the user’s sequence(s).

# Creation of mmseq database  from seq.faa file
mmseqs createdb seq.faa target_seq

mmseqs search phrog_profile_db target_seq results_mmseqs ./tmp

iv. How do I know which PHROG did my sequence match?

The createtsv module from MMseqs2 makes the search output easy to understand. In the first column the name of the matched PHROG now appears, the second column indicates the name of the protein in seq.faa. For more information about table column values, please see https://github.com/soedinglab/MMseqs2/wiki .

mmseqs createtsv phrog_profile_db target_seq results_mmseqs tsv_results_mmseqs

# TSV example
# phrog_10000    NC_021299_p59    258    0.949    6.200E-94    0    127    128    0    127    128