
---------------------------------------------------------------    README     --------------------------------------------------------------------------------

Wandy is a program that predicts Aneuploidy and CNV from Illumina whole genome sequencing data, and output a CNV plot for genome (*.png),
CNV plots for each chromosome (*.pdf), and CNV segments (*.txt) which recording the start, end, etc. It is developed in Linux environment using Java, 
R and shell scripts recruiting a thirty part module SAMTools. Pre-required tools are specified in configure files which allow users to customize their workflow. 
It allows paralleled computing under Sun Grid Engine (SGE) environment for processing multiple samples simultaneously. 
The entire computing time and maximum memory usage are generally determined by the sample with the highest sequence depth rather than the number of input samples. 
Thus, it normally takes a few hours to process hundreds samples. Comprehensive log files are generated to monitor processes and report errors. 
A variety of data qualities are summarized in tables and figures, such as chromosome based GC correction and coverage normalization (please refer to our user manual for details). 
It has been developed and fully tested with human reference genome hg19. The Wandy package, user manual, and a small test dataset can be downloaded from 
(http://bioinformaticstools.mayo.edu/research/Wandy).

Pre-requests:
	Java, R, Samtools

INSTALLATION:

 1) Go to a directory you want to install Wandy
 2) uncompress Wandy.tar
 3) open tool.info in source directory and set absolute paths for pre-requested programs/modules
 4) the file 'Wandy' will be program start
 
INSTRUCTIONS:
  
 Input requirement:
    Sorted and indexed BAM file that converted from Illumina WGS sequencing data.
 
 Usage:
 1) Go to the folder that you want to save your result (your work directory).
 2) Type one of the commands below:
	 i) /absolute/path/to/Wandy -i /path/to/my.bam
	ii) /absolute/path/to/Wandy -i /path/to/my/bam/files/directory/
 3) The output will be in your current directory
 
 Options:
 -i [required] a bam file or directory containing bam files
 -l [optional] read length (default 50bp)
 -q [optional] mapping quality of reads that will be taken into account (defalut 30)
 -e [optional] read type (pair-ended:2, single-ended 1, default: 1)
 -B [optional] bin size (base pair) of input BED file (use 10000bp or 500bp, default 10000)
 -F [optional] overwrite a previous run
 -S [optional] include BAM files in subdirectory
 -M [optional] contact email (i.e., your email)
 -h [optional] help
 
 Notes:
 1) Your current directory will be your working directory, all intermediate files and final results will be generated under it.
 2) If your input is a bam file folder, the program will take every bam file as input, make sure all of them are sorted and indexed.
 3) Just some general ideas: normally, a 4G(BAM file size) low coverage WGS bam will take 30 minutes and a 300G(BAM file size) high coverage WGS bam will take 10 hours to run.
 4) If any issue, please contact to Chen.xianfeng@mayo.edu or Wang.chen@mayo.edu

TEST SAMPLE:
 1) test sample can downloaded from http://bioinformaticstools.mayo.edu/research/
 2) test sample directory includes two subdirectories: one for test sample, another for the result of the test sample




Wandy Scripts:

|-- readme.txt
|-- reference
|   |-- DNA_summary_10kbin_info_5_12_2017.txt
|   |-- summary_10kbin_info_100NIPS_5_12_2017.txt
|   |-- run_info_Feb2_2015.ini
|   `-- summary_10000bin_info_initial.txt
|-- script
|   |-- jar
|   |   `-- HumanGenomeReadInfoWithBedNipt.jar
|   `-- r
|       |-- Rcall.R
|       |-- Rcallrun.R
|       `-- Rlib_v0p98
|           |-- functions
|           |   |-- computeAscore.R
|           |   |-- correctGCbias.R
|           |   |-- getCNVseg.R
|           |   |-- processWholeGenomeCovg.R
|           |   |-- resampleCNVseg.R
|           |   |-- wandy.analyzeRun.R
|           |   |-- wandy.analyzeSample.R
|           |   `-- wandy.miscFunctions.R
|           |-- ver.txt
|           `-- wandyCNV.loadPackage.R
|-- tool.info
`-- Wandy


 1) readme.txt
	this readme file
 2) DNA_summary_10kbin_info_5_12_2017.txt
	generated reference file for germline samples
	Columns:
	chr
	start.pos
	end.pos
	GC_Content
	median
	MAD
	SNR
 3)summary_10kbin_info_100NIPS_5_12_2017.txt
        generated reference file for cell free samples
        Columns:
        chr
        start.pos
        end.pos
        GC_Content
        median
        MAD
        SNR

 4) run_info_Feb2_2015.ini
 5) summary_10000bin_info_initial.txt
	pre-generated reference file
	Columns:
	chr
	start
	stop
	median
	MAD
	SNR
	GC_Content
	usable_bin
 6) HumanGenomeReadInfoWithBedNipt.jar
	read bam file and assign number of read start to each bin
 7) Rcall.R
	start per-sample analysis
 8) Rcallrun.R
	start to summarize per run
 9) computeAscore.R
10) correctGCbias.R
11) getCNVseg.R
12) processWholeGenomeCovg.R
13) resampleCNVseg.R
14) wandy.analyzeRun.R
15) wandy.analyzeSample.R
16) wandy.miscFunctions.R
17) ver.txt
	log file for script changes
18) wandyCNV.loadPackage.R
19) tool.info
	specify paths to pre-requested modules/programs
20) Wandy
	the start script of Wandy 


OUTPUTS:

|-- BinInfo
|   `-- test_sample_q30_b10000.txt
|-- run.id_wandy_ver0p95
|   `-- test_sample_within_run.id
|       |-- test_sample_within_run.id_Ascore_res.pdf
|       |-- test_sample_within_run.id_Ascore_res.txt
|       |-- test_sample_within_run.id_bin_normalized_covg.txt
|       |-- test_sample_within_run.id_Chr_summary.txt
|       |-- test_sample_within_run.id_CNV_segmentation.txt
|       |-- test_sample_within_run.id_CNV_seg.pdf
|       |-- test_sample_within_run.id_GC_bias_QC.pdf
|       |-- test_sample_within_run.id_GC_covg_perChr.pdf
|       |-- test_sample_within_run.id_rawChr_million_read.pdf
|       |-- test_sample_within_run.id_SampleLevel_stat.txt	
|       |-- test_sample_within_run.id_sampling_SegMtx.txt
|       |-- test_sample_within_run.id_WGS_covg.png
|       `-- test_sample_within_run.id_zoomout_BinCovg.txt	
|-- run_info_Feb2_2015.ini
`-- sample.info	


 1) test_sample_q30_b10000.txt
	number of read start per bin 
 2) test_sample_within_run.id_Ascore_res.pdf
	Plot A scores against alpha and beta values, and Acores per segment on each chromosome
 3) test_sample_within_run.id_Ascore_res.txt
	table of A scores against alpha and beta values
 4) test_sample_within_run.id_bin_normalized_covg.txt
	number of read starts per bin after GC correction and normalization
	Columns:
	Chr
	Start.Pos
	nmlz.cvg
	corrected.count
	orginal.count
	gc.content
	is.reliable.bin
 5) test_sample_within_run.id_Chr_summary.txt
	summary of corrected mapped read numbers per chromosome ,relative mapped read numbers per chromosome over whole genome
	Columns:
	Chr
	total.crct.cvg
	crct.cvg.perc.vs.autosome
 6) test_sample_within_run.id_CNV_segmentation.txt
	CNV segmentation regions and scores
	Columns:
	ID
	chrom
	loc.start
	loc.end
	num.mark
	seg.mean
 7) test_sample_within_run.id_CNV_seg.pdf
	plot of each CNV segmentation region
 8) test_sample_within_run.id_GC_bias_QC.pdf
	GC bais and how it is corrected
 9) test_sample_within_run.id_GC_covg_perChr.pdf
	GC plot of whole genome and coverage plot per chromosome before/after GC correction
10) test_sample_within_run.id_rawChr_million_read.pdf
	total raw reads in each chromosome
11) test_sample_within_run.id_SampleLevel_stat.txt
	a table of sample level summary
	Columns:
	sample.ID
	run.ID
	sample.type
	contact.person.ID
	project.name
	bin.record.file
	sequencing.finish.time
	alignment.finish.time
	reported.gender
	case.fraction
	YXratio
	classified.gendermax.min.GCCovg.ratio
	max.min.GCCovg.slope
	GC.area.distortion.perc
	mad.diff.nmlz.cvg
	total.raw.cvg.Million
	autosome.crct.cvg.Million
	ori.binsize
	CNV.binsize
	CNV.diff.MAD
12) test_sample_within_run.id_sampling_SegMtx.txt
	Columns: 
13) test_sample_within_run.id_WGS_covg.png
	Genome read coverage plot
14) test_sample_within_run.id_zoomout_BinCovg.txt
	Columns:
15) run_info_Feb2_2015.ini
	parameters used for CNV calling
16) sample.info
	Summary of input BAM files

