wolfpack: running a virus scan
This can be run on a single file or on a directory. It will try to guess from
the naming scheme if it is a Miseq output directory (i.e. with
Data/Intensities/BaseCalls/ structure) and analyze all fastq files in there.
The extension must be
.fastq.gz. It will then run a filtering
step based on quality, length and entropy (in short: reads with a lot of
repeats will be discarded), followed by a decontamination step where reads
of human/bacterial/bovine/fungal origin will be discarded. Finally, remaining
reads are blasted against the viral database. The list of organisms with the
count of reads is in files
orgs_list.csv in the output directory
virmet_output_...). For example, if we have a directory named
exp_01 with files
exp_01/AR-1_S1_L001_R1_001.fastq.gz exp_01/AR-2_S2_L001_R1_001.fastq.gz exp_01/AR-3_S3_L001_R1_001.fastq.gz exp_01/AR-4_S4_L001_R1_001.fastq.gz
we could run
virmet wolfpack --run exp_01
and, after some time, find the results in
virmet_output_exp01. Many files are
present, the most important ones being
first lists the viral organisms found with a count of reads that could be
matched to them.
[user@host test_virmet]$ cat virmet_output_exp_01/AR-1_S1/orgs_list.tsv organism reads Human adenovirus 7 126 Human poliovirus 1 strain Sabin 45 Human poliovirus 1 Mahoney 29 Human adenovirus 3+11p 19 Human adenovirus 16 1
The second file is a summary of all reads analyzed for this sample and how many were passing a specific step of the pipeline or matching a specific database.
[user@host test_virmet]$ cat virmet_output_exp01/AR-1_S1/stats.tsv raw_reads 6250 trimmed_too_short 462 low_entropy 1905 low_quality 0 passing_filter 3883 matching_humanGRCh38 3463 matching_bact1 0 matching_bact2 0 matching_bact3 0 matching_fungi1 0 matching_bt_ref 0 reads_to_blast 420 viral_reads 257 undetermined_reads 163
At the end of a run a directory for each sample (fastq file analyzed) is created containing the following files:
good_humanGRCh38_bact1_bact2_bact3_fungi1_bt_ref.cram good_humanGRCh38_bact1_bact2_bact3_fungi1_bt_ref.err ... good_humanGRCh38_bact1.cram good_humanGRCh38_bact1.err good_humanGRCh38.cram good_humanGRCh38.err orgs_list.tsv prinseq.err prinseq.log stats.tsv undetermined_reads.fastq.gz unique.tsv.gz viral_reads.fastq.gz
stats.tsv report the main output of the tool as
reported above, while
unique.tsv.gz reports blast hits to viral database.
As the names say,
contain, respectively, reads identified as of viral origin and reads not
matching any of the considered genomes.
prinseq.log are, respectively, the standard error and log
file of prinseq, used to filter reads. By inspecting this log file, VirMet
determines how many reads were discarded because of low entropy or low quality.
In the decontamination step, reads are aligned against human genome first,
those matching are discarded while those not matching are aligned against
the first set of bacterial genomes, and so on.
good_humanGRCh38.cram is the alignment of high quality reads (good) to
human genome, saved in CRAM format. File
the alignment to bacterial genomes in set bact1 of high quality reads (good) minus
those that were identified as matching human genome, and so on. File ending in
err contain the standard error of the conversion bam -> cram.
A typical cram workflow, also used in VirMet, can be found here.
Hot run (not fully tested)
A virus scan on a full MiSeq run typically lasts a few hours, many of which are spent
in the decontamination phase. Sometimes, after a run is completed, we would like to
run it again with a new viral database. In these cases,
wolfpack would run skipping
the previous phases to save time. It relies on the presence of intermediate files that,
if present, signals the pipeline that a specific step must be skipped.
These are the rules (must be intended for each sample):
- if both
prinseq.errexist, skip quality filtering,
good_humanGRCh38.errexists, skip the human reads decontamination,
good_humanGRCh38_bact1.errexists, skip the decontamination against first bacterial database, and so on.
Blasting against viral database will always be performed. If both
undetermined_reads.fastq.gz exist, their content will be copied into a file,
they will be removed, and this new file will be blasted against the viral database.
In short, if we change the viral database after a run has already been analyzed, simply
virmet wolfpack again will skip the quality filtering and go straight to
blast against viral database.