Using GROOT¶
GROOT uses a subcommand syntax; call the main program with groot
and follow it by the subcommand to indicate what action to take.
This page will cover a worked example, details on the available GROOT commands and some tips for using the program.
For more information on the graphs that GROOT uses, please read the groot-graphs page.
An example¶
This example will take us from raw reads to resistome profile for a single metagenome
Get some sequence data:
fastq-dump SRR4454613
Get a pre-clustered ARG database:
groot get -d resfinder
Create variation graphs from the ARG-database and index:
groot index -m resfinder.90 -i grootIndex -w 100 -p 8
Align the reads and report ARGs
groot align -i grootIndex -f SRR4454613.fastq -p 8 | groot report -c 0.95
GROOT subcommands¶
get¶
The get
subcommand is used to download a pre-clustered ARG database that is ready to be indexed. Here is an example:
groot get -d resfinder -o . --identity 90
The above command will download the ResFinder database, which has been clustered at 90% identity, check its integrity and unpacks it to the current directory. The database will be named resfinder.90
and contain several *.msa
files.
Flags explained:
-d
: which database to get-o
: directory to save the database to--identity
: the identity threshold at which the database was clustered (note: only 90 currently available)
The following databases are available:
- arg-annot (default)
- resfinder
- card
- groot-db
- groot-core-db
These databases were clustered by sequence identity and stored as Multiple Sequence Alignments (MSAs). See groot-databases for more info.
index¶
The index
subcommand is used to create variation graphs from a pre-clustered ARG database and then index them. Here is an example:
groot index -m resfinder.90 -i grootIndex -w 100 -p 8
The above command will create a variation graph for each cluster in the resfinder.90
database and initialise an LSH forest index. It will then move through each graph traversal using a 100 node window, creating a MinHash sketch for each window. Finally, each sketch is added to the LSH Forest index. The index will be named grootIndex
and contain several files.
Flags explained:
-m
: a directory of MSA files (the database fromgroot get
)-i
: where to save the index-w
: the window size to use (should be similar to the length of query reads)-p
: how many processors to use
The same index can be used on multiple samples, however, it should be re-indexed if you wish to change the seeding parameters.
Some more flags that can be used:
-k
: size of k-mer to use for MinHashing-s
: size of MinHash sketch-x
: number of partitions in the LSH Ensemble index-y
: maxK in the LSH Ensemble index--maxSketchSpan
: max number of identical neighbouring sketches permitted in any graph traversal
Important: GROOT can only accept MSAs as input. You can cluster your own database or usegroot get
to obtain a pre-clustered one.
align¶
The align
subcommand is used to align reads against the indexed variation graphs. Here is an example:
groot align -i grootIndex -f file.fastq -t 0.97 -p 8 > ARG-reads.bam
The above command will seed the fastq reads against the indexed variation graphs. It will then perform a hierarchical local alignment of each seed against the variation graph traversals. The output alignment is essentially the ARG classified reads (which may be useful) and can then be used to report full-length ARGs (using the report
subcommand).
Flags explained:
-i
: which index to use-f
: what FASTQ reads to align-t
: the containment threshold for seeding reads-p
: how many processors to use
Data can streamed in and out of the align subcommand. For example:
gunzip -c file.gz | ./groot align -i grootIndex -p 8 | ./groot report
Multiple FASTQ files can be specified as input, however all are treated as the same sample and paired-end info isn’t used. To specify multiple files, make sure they are comma separated (-f fileA.fq,fileB.fq
) or use gunzip/cat with a wildcard (gunzip -c *.fq.gz | groot…).
Some more flags that can be used:
--noAlign
: if set, no exact alignment will be performed (graphs will still be weighted using approximate read mappings)
report¶
The report
subcommand is used to processes graph traversals and generate a resistome profile for a sample. Here is an example:
groot report --bamFile ARG-reads.bam -c 1
Flags explained:
--bamFile
: the input BAM file (output fromgroot align
subcommand)-c
: the coverage needed to report an ARG (e.g. 0.95 = 95% ARG bases covered by reads)
Some more flags that can be used:
--lowCov
: overridesc
option and will report ARGs which may not be covered at the 5’/3’ ends