wolfpack: running a virus scan

This can be run on a single file or on a directory. It will try to guess from the naming scheme if it is a Miseq output directory (i.e. with Data/Intensities/BaseCalls/ structure) and analyze all fastq files in there. The extension must be .fastq or .fastq.gz. It will then run a filtering step based on quality, length and entropy (in short: reads with a lot of repeats will be discarded), followed by a decontamination step where reads of human/bacterial/bovine/fungal origin will be discarded. Finally, remaining reads are blasted against the viral database. The list of organisms with the count of reads is in files orgs_list.csv in the output directory (naming is virmet_output_...). For example, if we have a directory named exp_01 with files

exp_01/AR-1_S1_L001_R1_001.fastq.gz
exp_01/AR-2_S2_L001_R1_001.fastq.gz
exp_01/AR-3_S3_L001_R1_001.fastq.gz
exp_01/AR-4_S4_L001_R1_001.fastq.gz

we could run

virmet wolfpack --run exp_01

and, after some time, find the results in virmet_output_exp01. Many files are present, the most important ones being orgs_list.csv and stats.tsv. The first lists the viral organisms found with a count of reads that could be matched to them.

[user@host test_virmet]$ cat virmet_output_exp_01/AR-1_S1/orgs_list.tsv
organism    reads
Human adenovirus 7  126
Human poliovirus 1 strain Sabin 45
Human poliovirus 1 Mahoney  29
Human adenovirus 3+11p  19
Human adenovirus 16 1

The second file is a summary of all reads analyzed for this sample and how many were passing a specific step of the pipeline or matching a specific database.

[user@host test_virmet]$ cat virmet_output_exp01/AR-1_S1/stats.tsv
raw_reads       6250
trimmed_too_short       462
low_entropy     1905
low_quality     0
passing_filter  3883
matching_humanGRCh38    3463
matching_bact1  0
matching_bact2  0
matching_bact3  0
matching_fungi1 0
matching_bt_ref 0
reads_to_blast  420
viral_reads     257
undetermined_reads      163

Additional files

At the end of a run a directory for each sample (fastq file analyzed) is created containing the following files:

good_humanGRCh38_bact1_bact2_bact3_fungi1_bt_ref.cram
good_humanGRCh38_bact1_bact2_bact3_fungi1_bt_ref.err
...
good_humanGRCh38_bact1.cram
good_humanGRCh38_bact1.err
good_humanGRCh38.cram
good_humanGRCh38.err

orgs_list.tsv
prinseq.err
prinseq.log
stats.tsv
undetermined_reads.fastq.gz
unique.tsv.gz
viral_reads.fastq.gz

Files orgs_list.tsv and stats.tsv report the main output of the tool as reported above, while unique.tsv.gz reports blast hits to viral database.

As the names say, viral_reads.fastq.gz and undetermined_reads.fastq.gz contain, respectively, reads identified as of viral origin and reads not matching any of the considered genomes.

prinseq.err and prinseq.log are, respectively, the standard error and log file of prinseq, used to filter reads. By inspecting this log file, VirMet determines how many reads were discarded because of low entropy or low quality.

In the decontamination step, reads are aligned against human genome first, those matching are discarded while those not matching are aligned against the first set of bacterial genomes, and so on. File good_humanGRCh38.cram is the alignment of high quality reads (good) to human genome, saved in CRAM format. File good_humanGRCh38_bact1.cram contains the alignment to bacterial genomes in set bact1 of high quality reads (good) minus those that were identified as matching human genome, and so on. File ending in err contain the standard error of the conversion bam -> cram.

A typical cram workflow, also used in VirMet, can be found here.

Hot run (not fully tested)

A virus scan on a full MiSeq run typically lasts a few hours, many of which are spent in the decontamination phase. Sometimes, after a run is completed, we would like to run it again with a new viral database. In these cases, wolfpack would run skipping the previous phases to save time. It relies on the presence of intermediate files that, if present, signals the pipeline that a specific step must be skipped.

These are the rules (must be intended for each sample):

  • if both prinseq.log and prinseq.err exist, skip quality filtering,
  • if good_humanGRCh38.err exists, skip the human reads decontamination,
  • if good_humanGRCh38_bact1.err exists, skip the decontamination against first bacterial database, and so on.

Blasting against viral database will always be performed. If both viral_reads.fastq.gz and undetermined_reads.fastq.gz exist, their content will be copied into a file, they will be removed, and this new file will be blasted against the viral database.

In short, if we change the viral database after a run has already been analyzed, simply running virmet wolfpack again will skip the quality filtering and go straight to blast against viral database.