Wandy: A program for CNV/Aneuploidy detection from WGS sequencing data
Wandy is designed for Copy Number Variation (CNV) and Aneuploidy detection from large genomes such as human. It takes a sorted BAM file as input and report predicted chromosome regions that have amplifications or deletions using LOG2 ratio, generate graphic reports.
There are two download packages available:
Wandy contains five major steps:
- BAM2BIN conversion: the usually very large sorted and index bam files are converted to much smaller per genomic-bin (default bin-size = 10kb; adjustable according to configuration) coverage files. The raw coverages in each bin are computed by counting how many read-starts falling into each genomic bin, with the constraint that only aligned reads with mapping quality score greater than pre-specified mapping score threshold (default MAPQ value = 30) were counted.
- Bin-filtering: To avoid the irreproducibility resulting from regions of the genome that are either difficult to map or containing too many repeats, we created a blacklist of these regions with the following procedure: whole-genome sequencing data of Germline samples from normal patients were processed according to previous steps to obtain coverage per bin. Across these samples, median and MAD (median absolute deviation) of bin-coverage were summarized per bin. Bins do not meet a heuristic threshold will be discarded. The blacklist also includes centromeres.
- Bias-corrections: we mainly correct for PCR-induced GC-coverage bias. Such GC-coverage bias has to be adjusted to make different samples comparable and avoid producing false positives. Since sequencing coverage often has strong dependence with GC-content, we corrected GC-bias according to a smooth-spline model.
- Quality-metrics summary: through our practices, several critical QC metrics are summarized according to three major categories:
- sequencing depth
- Original sequencing depth: total number of read-starts
- Effective sequencing depth after GC-correction: the total corrected coverage after GC-correction
- GC-coverage bias
- Max/min ratio of GC-coverage bias:
- Max/min linear slope of GC-coverage bias:
- GC-coverage nonlinear distortion percentage
- sample-level variability
- Smoothness of normalized coverage
- sequencing depth
Here, to make samples with different coverage comparable, we first define a normalized coverage based on corrected coverage:
nmlz.cvg = crt.cvg/median(crt.cvg)
smooth.score = MAD(nmlz.cvg_i – nmlz.cvg_i+1)
- CNV-calling and A-score computation
Although aneuploidy detection is tightly related to copy number variation (CNV) detections, we encountered additional challenges beyond conventional CNV detection problems:
- Clinical samples are heterogeneous in terms of tissue sources and DNA quality. Therefore, the sample level variability has to be taken into consideration.
- With low sequencing coverage used in this study, the copy number detection may be prone to outliers or sequencing errors.
- Differing from the standard CNV problem, which utilizes consecutive segments of similar signal intensity (hybridization intensity for microarray, and read-depth for NGS), our purpose is to summarize global aneuploidy score reflecting overall genomic instability on a single sample basis.
In order to address above-mentioned issues, we developed the A-Score.
Wandy is a program that predicts Aneuploidy and CNV from Illumina whole genome sequencing data, and output a CNV plot for genome (*.png), CNV plots for each chromosome (*.pdf), and CNV segments (*.txt) which recording the start, end, etc. It is developed in Linux environment using Java, R and shell scripts recruiting a thirty part module SAMTools. Pre-required tools are specified in configure files which allow users to customize their workflow. It allows paralleled computing under Sun Grid Engine (SGE) environment for processing multiple samples simultaneously. The entire computing time and maximum memory usage are generally determined by the sample with the highest sequence depth rather than the number of input samples. Thus, it normally takes a few hours to process hundreds samples. Comprehensive log files are generated to monitor processes and report errors. A variety of data qualities are summarized in tables and figures, such as chromosome based GC correction and coverage normalization (please refer to our user manual for details). It has been developed and tested with human reference genome hg19 and hg38. The Wandy package, user manual, and a small test dataset can be downloaded from (https://bioinformaticstools.mayo.edu/research/Wandy).
Java, Samtools, R, and R lib ‘gplots’
1) Go to a directory you want to install Wandy
2) uncompress Wandy.tar.gz
3) open tool.info in source directory and set absolute paths for pre-requested programs/modules
4) the file ‘Wandy’ will be program start
Sorted and indexed BAM file that converted from Illumina WGS sequencing data.
1) Go to the folder that you want to save your result (your work directory).
2) Type one of the commands below:
i) /absolute/path/to/Wandy -i /path/to/my.bam
ii) /absolute/path/to/Wandy -i /path/to/my/bam/files/directory/
3) The output will be in your current directory
i)sampleID_within_run.id_WGS_corrected_CNV.png: a genome wide overview of CNV (each dot represents a bin)
ii)sampleID_within_run.id_CNV_segmentation.txt: predicted CNV segments
iii)sampleID_within_run.id_CNV_seg.pdf: a plot of the predicted CNV segments
-i [required] a bam file or directory containing bam files
-t [required] sample type (1: human solid cell for hg19, 2: human cell free/plamsa for hg19, 3: human solid cell for hg38, 2: human cell free/plamsa for hg38)
-l [optional] read length (default 50bp)
-q [optional] mapping quality of reads that will be taken into account (defalut 30)
-e [optional] read type (pair-ended:2, single-ended 1, default: 1)
-B [optional] bin size (base pair) of input BED file (use 10000bp or 500bp, default 10000)
-F [optional] overwrite a previous run
-S [optional] include BAM files in subdirectory
-M [optional] contact email (i.e., your email)
-h [optional] help
1) Your current directory will be your working directory, all intermediate files and final results will be generated under it.
2) If your input is a bam file folder, the program will take every bam file as input, make sure all of them are sorted and indexed.
3) Just some general ideas: normally, a 4G(BAM file size) low coverage WGS bam will take 30 minutes and a 300G(BAM file size) high coverage WGS bam will take 10 hours to run.
4) If any issue, please contact to Chen.email@example.com or Wang.firstname.lastname@example.org
1) test sample can downloaded from https://bioinformaticstools.mayo.edu/research/
2) test sample directory includes two subdirectories: one for test sample, another for the result of the test sample
Plot A scores against alpha and beta values, and Acores per segment on each chromosome
table of A scores against alpha and beta values
number of read starts per bin after GC correction and normalization
summary of corrected mapped read numbers per chromosome ,relative mapped read numbers per chromosome over whole genome
CNV segmentation regions and scores
num.mark: number of bins that supports this segment.
plot of each CNV segmentation region
GC bais and how it is corrected
GC plot of whole genome and coverage plot per chromosome before/after GC correction
total raw reads in each chromosome
a table of sample level summary
case.fraction : estimated fetal fraction for preganancy samples
YXratio : chrY and chrX coverage ratio
classified.gender: Classified gender based on YXratio
max.min.GCCovg.ratio : GC bias metric
max.min.GCCovg.slope : GC bias metric
GC.area.distortion.perc : GC bias metric
mad.diff.nmlz.cvg : variance of the nomalized coverage differenece
Genome wide raw read coverage plot
Genome wide gc content corrected read coverage plot
Genome wide smoothed gc content corrected read coverage plot
15) test_sample_within_run.id_WGS_estm_CNV.png (when stat ref file is used)
Genome wide estimated CNV (compared to refernce samples) plot
16) test_sample_within_run.id_WGS_smoothed_estm_CNV.png (when stat ref file is used)
Genome wide smoothed estimated CNV (compared to refernce samples) plot
intermediate file contains N (defalt 50) combined 10kb bind coverge percentage
Page last modified: November 3, 2020