# GROOT Databases

This is a brief overview of the databases used by **GROOT**.

---

## Overview

As mentioned on the [groot-graphs](https://groot-documentation.readthedocs.io/en/latest/groot-graphs.html) page, **GROOT** creates variation graphs from [Multiple Sequence Alignments](https://en.wikipedia.org/wiki/Multiple_sequence_alignment) (**MSAs**) and stores them as `groot graphs`.

The **MSAs** represent a clustered database, each cluster is a collection of sequences which share high nucleotide identity. **GROOT** reads in each **MSA** file and converts it to a graph.

A set of clustered databases is provided (use the `groot get` subcommand) or you can generate your own clustered database. To cluster a database yourself, you can use the following commands on any multifasta file containing the sequences you want to use with **GROOT**:

```bash
mkdir CLUSTERED-DB && cd $_

vsearch --cluster_size /path/to/ARGs.fna --id 0.90 --msaout MSA.tmp

awk '!a[$0]++ {of="./cluster-" ++fc ".msa"; print $0 >> of ; close(of)}' RS= ORS="\n\n" MSA.tmp && rm MSA.tmp

cd ..
```

- the above snippet will create a clustered database in the directory CLUSTERED-DB - now you can run `groot index`

```
groot index -m ./CLUSTERED-DB
```

## groot-db and groot-core-db

As mentioned earlier, the `groot get` subcommand can download a pre-clustered database for you to use with **GROOT**. The following databases are available:

- arg-annot (default)
- resfinder
- card
- groot-db
- groot-core-db

The `groot-db` and `groot-core-db` are both databases that are derived from ResFinder, ARG-annot and CARD. They have been included after requests by several users for a combination of available **ARG** databases. They were made as follows:

`groot-db` is made by combining all sequences in ResFinder, ARG-annot and CARD. Duplicates are removed and the sequences are then clustered at 90% identity.

`groot-core-db` is made by combining sequences that are present in each of the ResFinder, ARG-annot and CARD databases. One copy of each sequence is kept and then this collection is clustered at 90% identity.

Both `groot-db` and `groot-core-db` prepend a tag to each reference sequence so that the origin can be determined. For example:

```
>*groot-db_ARGANNOT__(AGly)APH-Stph:HE579073:1778413-1779213:801
```

In the directory downloaded by `groot get`, there will also be a timestamp that tells you when the database was created. The **GROOT** database will have used the most recently available versions of ResFinder/CARD/ARG-annot. The script used to do this is available in the **GROOT** repo:

```
db/groot-database/make-groot-dbs.sh
```