Setup and configurations

Remember you need both

Requirements

Recommended hardware

  • CPU: >10 cores per sample
  • Memory: 6GB per core
  • Storage: >75GB per sample

Note: Running the pipeline with less resources may work, but has not been tested.

Software

Nice to have

Installation

Note: The steps presented hereafter are thought to be run on a Linux system which has restricted access to the internet, typically a compute server where the login nodes have internet access but the compute nodes run offline. Consequently, online resources like Github repositories or images from Dockerhub must be downloaded previously to running the pipeline.

Clone both the GMS Poppy repo and the poppy-uppsala git repo

We recommend that the poppy-uppsala repository is cloned to your working directory, on the same level as the GMS Poppy repository.

# Set up a working directory path
WORKING_DIRECTORY="/path_working_to_directory"

A list of releases of the can be found at: * GMS Poppy pipeline: Releases. * poppy-uppsala pipeline: Releases.

Choose the release you need for both GMS Poppy and poppy-uppsala, for instance 0.2.0 and v0.2.1:

# Set versions
VERSION="v0.2.0"
VERSION_UU="v0.2.1"

# Clone selected version, use SSH URL if you have configured a local SSH key to Github (preferred)
git clone --branch ${VERSION} https://github.com/genomic-medicine-sweden/poppy.git ${WORKING_DIRECTORY}
git clone --branch ${VERSION_UU} https://github.com/clinical-genomics-uppsala/poppy_uppsala.git ${WORKING_DIRECTORY}

Create python environment

To run the poppy-uppsala pipeline, a python virtual environments is needed.

# Enter working directory
cd ${WORKING_DIRECTORY}

# Create a new virtual environment
python3 -m venv ${WORKING_DIRECTORY}/venv-poppy-uu

Install pipeline requirements

Activate the virtual environment and install pipeline requirements specified in requirements.txt.

# Enter working directory
cd ${WORKING_DIRECTORY}

# Activate python environment
source venv-poppy-uu/bin/activate

# Install requirements
pip install -r requirements.txt

Setup required data and config

Download the data

# make sure hydra-genetics is available
# make sure that TMPDIR points to a location with a lot of storage, it
# will be required to fetch reference data
# export TMPDIR=/PATH_TO_STORAGE
hydra-genetics --debug --verbose references download -o design_and_ref_files  -v config/references/references.hg19.yaml -v config/references/design_files.hg19.yaml -v config/references/nextseq.hg19.pon.yaml

Update the config

You need to specify the local paths to use for some required data, for instance what files should be used for the reference genome GRCh38. For an example of how those local files are specified in the configurations for poppy-uppsala, please see some reference section.

Note that poppy-uppsala overloads some config parameters of Poppy GMS and also adds new parameters to the dictionary of parameters, as for instance: - Mosdepth coverage in exon regions only, - Home folder of the analysis, - Parameters for the bamsnap tool to take screenshots in IGV.

When starting the pipeline, the dictionary of parameters used in the workflow is created by Snakemake from the options in the command line snakemake [options] <snakefile> with: - the YAML entries in the --configfile for the pipeline, - the single parameters passed via the --config argument.

Pulling containers

In order to have the versions of the softwares used in the pipeline parsed correctly and reported in MultiQC, the following is required: - each rule must have a container specified in the container section of the rule, - the image must be available on Dockerhub or locally, and have a name in the form path_or_url/<image_name>:<version_tag>.

Example if pulling the image from Dockerhub

If the image is to be pulled from Dockerhub: "container: docker://hydragenetics/bamsnap:0.2.19", then the <image_name> is bamsnap and the <version_tag> is 0.2.19.

When starting the pipeline, the image will be pulled automatically if they are not available locally, before the first rule is executed:

Building DAG of jobs...
Pulling singularity image docker://hydragenetics/bamsnap:0.2.19.
Using shell: /usr/bin/bash
Provided cores: ...

By default, with Snakemake, the images are saved in the local cache directory .snakemake/singularity. You can change the location of the cache directory by: - using the --use-singularity --singularity-prefix options in the snakemake command to specify the cache directory, - setting the entry singularity-prefix in config/profile: singularity-prefix: "path/to/singularity/cache".

With the example of bansmap above, you will see the following printout when the rule bamsnap is executed:

[Timestamp]
rule basmsnap:
    input: 
        ...
    output: 
        ...
    jobid: ...
    reason: Missing output files: ...
    wildcards: ...
    resources: ...

Activating singularity image .snakemake/singularity/<a_very_long_id_with_[a-z0-9]>.simg
Example if using a pulled image in a local path

You may also use an image that is built locally, or that you manually pull first:

# Pull the image from Dockerhub
cd <your_location_for_images>
docker pull hydragenetics/bamsnap:0.2.19
# Convert the image to Singularity format
singularity build --force bamsnap_0.2.19.simg docker-daemon://hydragenetics/bamsnap:0.2.19

Then you can use the image in the pipeline by specifying the path to the image in the container section of the rule:

container: "your_location_for_images/bamsnap_0.2.19.simg"

Input sample files

The pipeline uses sample input files (samples.tsv and units.tsv) with information regarding sample information, sequencing meta information as well as the location of the fastq files. Specification for the input files can be found at Poppy GMS schemas. Using the python virtual environment created above it is possible to generate these files automatically using hydra-genetics create-input-files:

hydra-genetics create-input-files -d path/to/fastq-files/directory/

NB: you might need to adapt the regular expression used to parse the names of the fastq files, please refer to the documentation for the options. Consider especially having a look at the options -s and -n.

Information about known variants and custom filtering

Writing the Excel report requires some additional information that depend on the end-usage of the reported data. It is up to each (group of) user(s) to adapt these additional information that drive the filtering of the reported variants. The analysis used at CGU uses for instance: - A specific table with known variants that should be reported separately from other variants: , - Files that list artifacts (machine-specific and/or variant caller-specific): local TSV files, - Files used to denoise the variant calls, - pindel regions to limit the computational cost: a local BED file, - Sets of normal references for CNVkit and GATK, - Custom annotation database for annotation with bcftools, - Custom filters to set what variants are shown per default upon opening of the report : YAML files in ./config/filters, there is a description line for each filter.

For an example of how those local files are specified in the configurations for poppy-uppsala, please see: - The configurations when the sequencing is done on a NextSeq machine, - The configurations when the sequencing is done on a NovaSeqX machine,