PHROGs Database is freely available online via https://phrogs.lmge.uca.fr/index.php where multiple data can be downloaded locally as the user requires it. Users can retrieve fasta amino acid fasta sequences used in database (FAA_phrog.tar.gz), fasta multi sequence alignments (MSA_phrogs.tar.gz), Hidden Markov Model profiles from alignments (HMM_phrog.tar.gz), and a table of current annotation for every single PHROG (phrog_annot_v2.tsv). In order to use the PHROG database as a target database for either MMseqs2 and HHsuite3 tools, MSA and HMM files are recommended.
MSA files can be downloaded to a remote server via command line :
# Create a directory for PHROGs database
mkdir PHROGs
cd PHROGs
mkdir MSA
cd MSA
wget https://phrogs.lmge.uca.fr/downloads_from_website/MSA_phrogs.tar.gz
tar -xvf MSA_phrogs.tar.gz
cd ../
Notice that PHROGs MSA are in FASTA format.
The PHROG database can be easily adapted to be used either within MMseqs2 (Steinegger and Söding 2016) or HHsuite3 tools (Remmert et al., 2012). Each software requires a different processment of PHROG files as follows.
Installation of HHsuite3 software can be done via Conda or Docker, please see the process and requirements on https://github.com/soedinglab/hh-suite .
To create A3M files from PHROG FASTA multi sequence alignments, the reformat.pl script will be used. The following script is called from-fasta-2-a3m.sh and must be created in the parental directory of MSA directory. Since each of the 38,880 PHROG MSA files must be converted independently, this step may take a couple of minutes. Please, be patient.
# Be sure of being in PHROGs directory
nano from-fasta-2-a3m.sh # Copy the bellowing code
chmod 777 from-fasta-2-a3m.sh
mkdir A3M
./from-fasta-2-a3m.sh
#!/bin/bash
# Code name: from-fasta-2-a3m.sh
# Function: Convert FASTA files into A3M files.
msa=($(ls MSA/FMA_phrog | sed 's/.fma//g'))
for ((i=0; i<${#msa[@]}; i++)); do
echo "${msa[$i]}"
perl reformat.pl fas a3m MSA/FMA_phrog/${msa[$i]}.fma A3M/${msa[$i]}.a3m
done
The reformat.pl file is part of the multiple modules stored in HHsuite3. If your version of HHsuite3 does not contain reformat.pl, the script can be downloaded from https://github.com/soedinglab/hh-suite/blob/master/scripts/reformat.pl .
In order to use PHROGs’ MSA files, ffindex, ffindex_apply, hhmake and cstranslate modules from HHsuite3 must be applied to the recently created A3M format files of PHROGs. To achieve this goal, cs219.lib library is required. If this library is not available in your HHsuite3 version, it can be downloaded from https://github.com/soedinglab/hh-suite/blob/master/data/cs219.lib .
As a result, MSA files will generate a searchable database. After the following code, 6 files in multiple formats (a3m, cs2019 and hhm) will be created. These files are critical for the use of PHROG as a target database with hhblits or hhsearch modules.
# If you do not have cs219.lib, please use the following code lines. Be sure of being on PHROGs directory
nano cs219.lib # Copy cs2019.lib from HHsuite3 source
chmod 777 cs219.lib
# If you have cs219.lib, please ignore the above code lines.
# Now, continue with the following commands
# Creation of ordered ffindex and ffdata for files in A3M files
ffindex_build -as phrogs_a3m.ffdata phrogs_a3m.ffindex A3M
ffindex_apply phrogs_a3m.ff{data,index} -i phrogs_hhm.ffindex -d phrogs_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
cstranslate -i phrogs_a3m -A ./cs219.lib -o phrogs_cs219 -b -f -I a3m
In order to search a single query sequence, a MSA or a HMM profile, into the PHROG database ,
hhblits
is the fastest option of the HHsuite3 modules. Using
hhblits
is quite easy: it is necessary to use a single sequence file (here, seq.faa) and the recently created MSA database (see code above) as the target database.
hhblits -i seq.faa -d phrogs -n 1 -o results_file
Another option within HHsuite3 is to use the
hhsearch
module. It takes more time than
hhblits
but is reliable as well. Again, a single sequence file is needed as input (seq.faa) and the recently created MSA database (see code above) as the target database.
hhsearch -i seq.faa -d phrogs -o results_file
It is also possible to compare several distinct proteins in a multi-fasta file against all PHROGs, using
hhblits_omp
or
hhsearch_omp
.
That type of search requires builiding an index of your query fasta as follows :
ffindex_from_fasta -s seq.ff{data,index} seq.faa
Then you can compute the search :
hhblits_omp -i seq -d phrogs -o results_file
All results for each sequence in seq.faa will be appended to results_file.ffdata.
First let’s analyse hhsearch or hhblits output, both have the same format. The name of the protein inside the seq.faa file is at the first line, withe the title “Query”. Then, a table of hits is used to show the top matches ordered by their probability of being true positives. As can be seen, the sequence called JN699628_p59 is the best match.
# HHsearch result
Query NC_021299_p59
Match_columns 128
No_of_seqs 4 out of 5
Neff 2.23736
Searched_HMMs 38880
Date Mon Mar 1 00:24:48 2021
Command hhsearch -i seq.faa -d /scratch/rbucio/PVP_test/HHsuite/msa-normal -o result_search
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 phrog_10000 ## JN699628_p59 100.0 1.4E-61 3.6E-66 361.2 0.0 128 1-128 1-128 (128)
2 phrog_1389 ## NC_008265_p31 57.3 4 0.0001 28.9 0.0 109 14-124 1-114 (114)
3 phrog_13578 ##NC_028809_p50 30.0 21 0.00055 27.6 0.0 17 55-71 16-32 (155)
4 phrog_22794 ##KP027195_p115 29.2 23 0.00058 23.9 0.0 11 9-19 22-32 (61)
#HHblits result
Query NC_021299_p59
Match_columns 128
No_of_seqs 27 out of 29
Neff 5.16007
Searched_HMMs 184
Date Tue Mar 2 02:35:22 2021
Command hhblits -i seq.faa -d /scratch/rbucio/PVP_test/HHsuite/msa-normal -o hhblits_result
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 phrog_10000 ## JN699628_p59 100.0 2.2E-74 2.7E-78 430.7 0.0 128 1-128 1-128 (128)
2 phrog_1389 ## NC_008265_p31 96.0 3.3E-05 4.8E-09 52.6 0.0 109 14-124 1-114 (114)
3 phrog_38278 ## p17072 VI_01301 59.5 0.78 9E-05 33.0 0.0 32 13-46 41-72 (180)
4 phrog_33209 ## NC_029027_p11 40.0 2.7 0.0003 27.9 0.0 30 80-109 40-69 (110)
For more information check out the HHsuite user’s guide at :
http://gensoft.pasteur.fr/docs/hhsuite/3.0-beta.2/hhsuite-userguide.pdf
.
Please, see the installation process and requirements of MMseqs2 software on https://github.com/soedinglab/mmseqs2/wiki#install-mmseqs2-for-linux .
In order to use the PHROG database with MMseqs2, it is necessary to download the HMM_phrog.tar.gz file available in the PHROGs webpage. This process uses the ffindex_build module from HHsuite3 to create a profile database from the PHROGs HMM profiles.
# Create an exclusive directory for mmseqs usage
mkdir MMseqs
cd MMseqs
# Download HMM_phrog.tar.gz
wget https://phrogs.lmge.uca.fr/downloads_from_website/HMM_phrog.tar.gz
# Open HMM file
tar -xvf HMM_phrog.tar.gz
# Creation of index and database
ffindex_build -a phrog_hhm_db phrog_hhm_db.index HMM_phrog/
# Conversion of HMM database to profile database
mmseqs convertprofiledb phrog_hhm_db phrog_profile_db
Now that PHROGs database is currently available to be used by the MMseqs2 tool, it is easy to search PHROG profiles into any single sequence or bunch of sequences. It is necessary to highlight that MMseqs is faster than any HHsuite3 module (hhblits or hhsearch) and it allows the search of multiple sequences simultaneously if they are in a single file.
First, any sequence file or file of sequences must be converted into an MMseqs2 database in order to be suitable for searching. It is recommended to use the PHROG database as the query database, meanwhile the target database will be composed of the user’s sequence(s).
# Creation of mmseq database from seq.faa file
mmseqs createdb seq.faa target_seq
mmseqs search phrog_profile_db target_seq results_mmseqs ./tmp
The createtsv module from MMseqs2 makes the search output easy to understand. In the first column the name of the matched PHROG now appears, the second column indicates the name of the protein in seq.faa. For more information about table column values, please see https://github.com/soedinglab/MMseqs2/wiki .
mmseqs createtsv phrog_profile_db target_seq results_mmseqs tsv_results_mmseqs
# TSV example
# phrog_10000 NC_021299_p59 258 0.949 6.200E-94 0 127 128 0 127 128