SNV and INDEL calling, annotation and filtering

See the snv_indels hydra-genetics module documentation for more details on the softwares for variant calling, annotation hydra-genetics module for annotation and filtering hydra-genetics module for filtering. Default hydra-genetics settings/resources are used if no configuration is specified.

dag plot

Pipeline output files:

results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf
results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf
bam_dna/mutect2_indel_bam/{sample}_{type}.bam

SNV and INDEL calling

Small variants are called with GATK Mutect2 v4.1.9.0 and Vardict v1.8.3.

GATK Mutect2 variant calling

SNVs and INDELs are called by Mutect2 on individual chromosome bamfiles.

Configuration

Reference files

reference fasta genome
design bed region file (split by bed_split rule into chromosome chunks)

Cluster resources

Options	Value
time	"48:00:00"

GATK Mutect2 merging

The stats file from GATK Mutect2 calling are merged with GATK MergeMutectStats v4.1.9.0 and the vcf files are merged with bcftools concat v1.15.

GATK Mutect2 vcf soft filtering

Merged Mutect2 vcf files are softfiltered with GATK FilterMutectCalls v4.1.9.0, which puts filter flags in the vcf FILTER column.

GATK Mutect2 vcf hard filtering

Hardfilter Mutect2 vcf files based on the FILTER flags using the in-house script mutect_pass_filter.py (rule). and will only keep variants flagged as:

PASS
multiallelic

Vardict variant calling

SNVs and INDELs are called by Vardict on individual chromosome bamfiles.

Configuration

References

reference fasta genome
design bed region file (split by bed_split rule into chromosome chunks)

Software settings

Options	Value	Description
bed_columns	-c 1 -S 2 -E 3 -g 4	bed column definitions
extra	-Q 1	remove reads with 0 mapping quality
allele_frequency_threshold	0.01	minimal reported allele frequency

Allele frequency threshold set w.r.t. the noise level in the called variants, which itself depends on the sequencing machine and on the region of the genome that is analyzed.

Cluster resources

Options	Value
time	"48:00:00"

Vardict vcf merging

The Vardict vcf files from individual chromosomes are merged with bcftools concat v1.15.

Variant vcf decomposition and normalization

Variants called by Vardict and Mutect2 are decomposed by vt decompose followed by vt decompose_blocksub v2015.11.10. The vcf files are then normalized by vt normalize v2015.11.10.

Variant ensemble

Variant vcf files from the two callers are ensembled into one vcf file using bcbio-variation-recall ensemble v0.2.6. All variants from both caller are retained. When both callers call the same variant the INFO and FORMAT data is taken from the Vardict vcf file.

Configuration

Software settings

Options	Value	Description
support	-n 1	keep all variant call by at least one caller
sort_order	--names vardict, gatk_mutect2	priority order for retaining variant information

Annotation

The ensembled vcf file is annotated firstly using VEP, followed by artifact annotation and background annotation. See the annotation hydra-genetics module for additional information.

VEP

The ensembled vcf file is annotated using VEP v105. VEP adds a pletora of information for each variant which is specified by the configuration flags listed below. Of note are --pick which picks only one representative transcript for each variant, --af_gnomad which adds germline information, and --cache which uses a local copy of the databases for better performance. See VEP options for more information.

Configuration

References

VEP cache including all databases adapted for reference genome GRCh37 and VEP version 105

* Fasta reference genome

Software settings

Options	Value
vep_cache	path_to_vep_cache
mode	--offline --cache --merged
extra	--assembly GRCh38 --check_existing --pick --variant_class --everything --pick_order mane_select,mane_plus_clinical,canonical,biotype,rank,appris,tsl,ccds,length,ensembl,refseq

--everything: see documentation of VEP

Custom order of the criteria to pick the relevant annotation for a variant.

Resources

Options	Value
mem_mb	30720
mem_per_cpu	6144
threads	5
time	"6:00:00"

Artifact annotation

Identifying artifacts is crucial in a Tumor-only FFPE pipeline such as the GMS560 Twist Solid pipeline. The artifact annotation is performed using the in-house script artifact_annotation.py (rule). The annotation is based on variants called in a number of normal FFPE samples sequenced using the same panel and on the same sequencing machine type as the analysed tumor samples. See references for more information on how the Panel of Normal was created.

Example annotation for one variant added to a vcf file in the INFO field:

Field	Value	Description
Artifact	12,35,36	Nr of calls made in the PoN using Vardict, Mutect2, and total of samples in the PoN
ArtifactMedian	0.29,0.25	Median MAF of the calls
ArtifactNrSD	0.58,0.56	Number of standard deviation between the median allele frequency in the PoN and the call in the variant

Configuration

References

Panel of Normal with position specific artifact information for each caller and variant type

Hotspot annotation

Annotate clinically important variants in the vcf file using the in-house script add_hotspot_annotation.py (rule) and a hotspot list.

Configuration

Reference

Hotspot positions file

Background SNV annotation

In positions with high background noise it can be hard to distinguish low MAF variants. The background level for all SNVs is therefor added in the vcf file. The background annotation is performed using the in-house script background_annotation.py (rule). It is based on a panel of normal with position specific alternative alleles frequencies obtained from genome VCF files created by GATK Mutect2 v4.1.9.0. See references for more information on how the Panel of Normal was created.

Example annotation for one variant added to a vcf file in the INFO field:

Field	Value	Description
PanelMedian	1.0013	Median fraction of alternative alleles
PositionNrSD	12.17	Number of standard deviation between the Median fraction in the PoN and allele frequency of the call in the variant

Configuration

References

Panel of Normal with position specific background information

Filtering

Annotated vcfs are hard filtered first by removing regions outside exons and then filtered by a number of filtering criteria described below. See the filtering hydra-genetics module for additional information. A soft filtered version of the exonic regions is also provided for development and other investigations.

Extract exonic regions

Use bcftools filter -R v1.15 to extract variants overlapping exonic regions (including 20 bp padding) defined in a bed file which is a sub bed file of the general design bed file.

Configuration

References

Bed file with exonic regions including 20 bp padding

Hard filter vcf

The exonic vcf files are filtered using the hydra-genetics filtering functionality included in v0.15.0. The filters are specified in the config file config_hard_filter_uppsala.yaml and consists of the following filters:

Configuration

Software settings

Filter	Description
intron	Filter intronic variants with the following exceptions; splice variant, in genes MET or TERT, or in the COSMIC database
low vaf	Filter variants with variant allele frequency below 1%
artifacts	Filter variants found in more than 3 normal samples except if the allele frequency is more than 5 standard deviations above the median allele frequency found in the normals
background	Filter SNV variants where the positions background median noise plus 4 standard deviations is higher than the variant allele frequency except for variants in hotspots
germline	Filter germline variants when the GnomAD global population allele frequency is above 0.5%
variant observations	Filter variants with fewer than 20 supporting reads except for variants in hotsports and in the TERT gene which only need 10 and 4 supporting reads respectively

Combine SNVs in the same codon

Two or more variants affecting the same codon can have different clinical implications when considered individually compared to in combination. This is because the combined variant could end up coding for a different amino acid then the when only looking at the variant individually. Variants within the same codon are therefore combined and added to the vcf file using the in-house script add_multi_snv_in_codon.py (rule). Codon information is based on the VEP annotation. Annotation information is taken from the variant with the highest allele frequency. After adding the combined variants the vcf is sorted and annotated again.

Configuration

Software settings

Options	Value	Description
af_limit	0.00	No lower limit for allele frequency
artifact_limit	10000	Allow any number of observations (10000) in the PoN as they are already filtered

Result file

results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.vcf

QCI AF correction of vcf

The clinical interpretation tool QCI calculates allele frequency from the AD FORMAT field instead of using the AF FORMAT field supplied by the callers. This has shown to be wrong especially for INDELs. The AD field is therefore corrected so that the allele frequency based on the AD field corresponds to the AF field.

Result file

results/dna/{sample}_{type}/vcf/{sample}_{type}.annotated.exon_only.filter.hard_filter.codon_snv.qci.vcf

GATK Mutect2 variant bam file

When GATK Mutect2 finds INDEL candidates it realignes reads in this regions and outputs a realigned bam-file covering these INDEL regions. This makes it possible to inspect INDELs called by Mutect2 in IGV. As Mutect2 runs on individual chromosomes these bam-files are then merged, sorted and indexed before.

Result file

bam_dna/mutect2_indel_bam/{sample}_{type}.bam