Using GROOT

GROOT uses a subcommand syntax; call the main program with groot and follow it by the subcommand to indicate what action to take.

This page will cover a worked example, details on the available GROOT commands and some tips for using the program.

For more information on the graphs that GROOT uses, please read the groot-graphs page.


An example

This example will take us from raw reads to resistome profile for a single metagenome

Get some sequence data:

fastq-dump SRR4454613

Get a pre-clustered ARG database:

groot get -d resfinder

Create variation graphs from the ARG-database and index:

groot index -m resfinder.90 -i grootIndex -w 100 -p 8

Align the reads and report ARGs

groot align -i grootIndex -f SRR4454613.fastq -p 8 | groot report -c 0.95

GROOT subcommands

get

The get subcommand is used to download a pre-clustered ARG database that is ready to be indexed. Here is an example:

groot get -d resfinder -o . --identity 90

The above command will download the ResFinder database, which has been clustered at 90% identity, check its integrity and unpacks it to the current directory. The database will be named resfinder.90 and contain several *.msa files.

Flags explained:

  • -d: which database to get
  • -o: directory to save the database to
  • --identity: the identity threshold at which the database was clustered (note: only 90 currently available)

The following databases are available:

  • arg-annot (default)
  • resfinder
  • card
  • groot-db
  • groot-core-db

These databases were clustered by sequence identity and stored as Multiple Sequence Alignments (MSAs). See groot-databases for more info.

index

The index subcommand is used to create variation graphs from a pre-clustered ARG database and then index them. Here is an example:

groot index -m resfinder.90 -i grootIndex -w 100 -p 8

The above command will create a variation graph for each cluster in the resfinder.90 database and initialise an LSH forest index. It will then move through each graph traversal using a 100 node window, creating a MinHash sketch for each window. Finally, each sketch is added to the LSH Forest index. The index will be named grootIndex and contain several files.

Flags explained:

  • -m: a directory of MSA files (the database from groot get)
  • -i: where to save the index
  • -w: the window size to use (should be similar to the length of query reads)
  • -p: how many processors to use

The same index can be used on multiple samples, however, it should be re-indexed if you wish to change the seeding parameters.

Some more flags that can be used:

  • -k: size of k-mer to use for MinHashing
  • -s: size of MinHash sketch
  • -x: number of partitions in the LSH Ensemble index
  • -y: maxK in the LSH Ensemble index
  • --maxSketchSpan: max number of identical neighbouring sketches permitted in any graph traversal
Important: GROOT can only accept MSAs as input. You can cluster your own database or use groot get to obtain a pre-clustered one.

align

The align subcommand is used to align reads against the indexed variation graphs. Here is an example:

groot align -i grootIndex -f file.fastq -t 0.97 -p 8 > ARG-reads.bam

The above command will seed the fastq reads against the indexed variation graphs. It will then perform a hierarchical local alignment of each seed against the variation graph traversals. The output alignment is essentially the ARG classified reads (which may be useful) and can then be used to report full-length ARGs (using the report subcommand).

Flags explained:

  • -i: which index to use
  • -f: what FASTQ reads to align
  • -t: the containment threshold for seeding reads
  • -p: how many processors to use

Data can streamed in and out of the align subcommand. For example:

gunzip -c file.gz | ./groot align -i grootIndex -p 8 | ./groot report

Multiple FASTQ files can be specified as input, however all are treated as the same sample and paired-end info isn’t used. To specify multiple files, make sure they are comma separated (-f fileA.fq,fileB.fq) or use gunzip/cat with a wildcard (gunzip -c *.fq.gz | groot…).

Some more flags that can be used:

  • --noAlign: if set, no exact alignment will be performed (graphs will still be weighted using approximate read mappings)

report

The report subcommand is used to processes graph traversals and generate a resistome profile for a sample. Here is an example:

groot report --bamFile ARG-reads.bam -c 1

Flags explained:

  • --bamFile: the input BAM file (output from groot align subcommand)
  • -c: the coverage needed to report an ARG (e.g. 0.95 = 95% ARG bases covered by reads)

Some more flags that can be used:

  • --lowCov: overrides c option and will report ARGs which may not be covered at the 5’/3’ ends