Wandy: A program for CNV/Aneuploidy detection from WGS sequencing data
Wandy is designed for Copy Number Variation (CNV) and Aneuploidy detection from large genomes such as human. It takes a sorted BAM file as input and report predicted chromosome regions that have amplifications or deletions using LOG2 ratio, generate graphic reports.
There are two download packages available:
Wandy contains five major steps:
- BAM2BIN conversion: the usually very large sorted and index bam files are converted to much smaller per genomic-bin (default bin-size = 10kb; adjustable according to configuration) coverage files. The raw coverages in each bin are computed by counting how many read-starts falling into each genomic bin, with the constraint that only aligned reads with mapping quality score greater than pre-specified mapping score threshold (default MAPQ value = 30) were counted.
- Bin-filtering: To avoid the irreproducibility resulting from regions of the genome that are either difficult to map or containing too many repeats, we created a blacklist of these regions with the following procedure: whole-genome sequencing data of Germline samples from normal patients were processed according to previous steps to obtain coverage per bin. Across these samples, median and MAD (median absolute deviation) of bin-coverage were summarized per bin. Bins do not meet a heuristic threshold will be discarded.
- Bias-corrections: we mainly correct for PCR-induced GC-coverage bias. Such GC-coverage bias has to be adjusted to make different samples comparable and avoid producing false positives. Since sequencing coverage often has strong dependence with GC-content, we corrected GC-bias according to a smooth-spline model.
- Quality-metrics summary: through our practices, several critical QC metrics are summarized according to three major categories:
- sequencing depth
- Original sequencing depth: total number of read-starts
- Effective sequencing depth after GC-correction: the total corrected coverage after GC-correction
- GC-coverage bias
- Max/min ratio of GC-coverage bias:
- Max/min linear slope of GC-coverage bias:
- GC-coverage nonlinear distortion percentage
- sample-level variability
- Smoothness of normalized coverage
- sequencing depth
Here, to make samples with different coverage comparable, we first define a normalized coverage based on corrected coverage:
nmlz.cvg = crt.cvg/median(crt.cvg)
smooth.score = MAD(nmlz.cvg_i – nmlz.cvg_i+1)
- CNV-calling and A-score computation
Although aneuploidy detection is tightly related to copy number variation (CNV) detections, we encountered additional challenges beyond conventional CNV detection problems:
- Clinical samples are heterogeneous in terms of tissue sources and DNA quality. Therefore, the sample level variability has to be taken into consideration.
- With low sequencing coverage used in this study, the copy number detection may be prone to outliers or sequencing errors.
- Differing from the standard CNV problem, which utilizes consecutive segments of similar signal intensity (hybridization intensity for microarray, and read-depth for NGS), our purpose is to summarize global aneuploidy score reflecting overall genomic instability on a single sample basis.
In order to address above-mentioned issues, we developed the A-Score.
Page last modified: March 7, 2017