GROOT Databases

This is a brief overview of the databases used by GROOT.


Overview

As mentioned on the groot-graphs page, GROOT creates variation graphs from Multiple Sequence Alignments (MSAs) and stores them as groot graphs.

The MSAs represent a clustered database, each cluster is a collection of sequences which share high nucleotide identity. GROOT reads in each MSA file and converts it to a graph.

A set of clustered databases is provided (use the groot get subcommand) or you can generate your own clustered database. To cluster a database yourself, you can use the following commands on any multifasta file containing the sequences you want to use with GROOT:

mkdir CLUSTERED-DB && cd $_

vsearch --cluster_size /path/to/ARGs.fna --id 0.90 --msaout MSA.tmp

awk '!a[$0]++ {of="./cluster-" ++fc ".msa"; print $0 >> of ; close(of)}' RS= ORS="\n\n" MSA.tmp && rm MSA.tmp

cd ..
  • the above snippet will create a clustered database in the directory CLUSTERED-DB - now you can run groot index
groot index -m ./CLUSTERED-DB

groot-db and groot-core-db

As mentioned earlier, the groot get subcommand can download a pre-clustered database for you to use with GROOT. The following databases are available:

  • arg-annot (default)
  • resfinder
  • card
  • groot-db
  • groot-core-db

The groot-db and groot-core-db are both databases that are derived from ResFinder, ARG-annot and CARD. They have been included after requests by several users for a combination of available ARG databases. They were made as follows:

groot-db is made by combining all sequences in ResFinder, ARG-annot and CARD. Duplicates are removed and the sequences are then clustered at 90% identity.

groot-core-db is made by combining sequences that are present in each of the ResFinder, ARG-annot and CARD databases. One copy of each sequence is kept and then this collection is clustered at 90% identity.

Both groot-db and groot-core-db prepend a tag to each reference sequence so that the origin can be determined. For example:

>*groot-db_ARGANNOT__(AGly)APH-Stph:HE579073:1778413-1779213:801

In the directory downloaded by groot get, there will also be a timestamp that tells you when the database was created. The GROOT database will have used the most recently available versions of ResFinder/CARD/ARG-annot. The script used to do this is available in the GROOT repo:

db/groot-database/make-groot-dbs.sh