
Introduction
---------------------------------------

The scripts provided here will allow you to perform differential expression 
analysis comparing two groups of samples (referred to later as group A and 
group B).

You must manually pre-process the Report/GeneCount.csv data into an edgeR 
compatible format and arrange the columns into groups prior to execution.

Not all of the software required to execute this analysis is included in the
MAP-RSeq package. Please carefully review the documentation below to insure a
successful execution.


Pre-Requisites
---------------------------------------

You must have the following installed prior to running this analysis:

1) Perl, available packaged with your Linux distribution or available at 

	http://www.perl.org

2) The R Statistical Computing Package, also commonly available in many 
   Linux distributions.  A standalone version can be downloaded at:

	http://cran.us.r-project.org/

3) The Bioconductor edgeR package, available at:

	http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

	You can install it manually with the following commands:


	Start an interative R session with:
	   R

	Install the base BioConductor package with:
	   source("http://bioconductor.org/biocLite.R")

	Install the edgeR module with:
	   biocLite("edgeR")

	PLEASE NOTE:

	   If you have installed edgeR to a non-default location you
	   may need to customize the edgeR_pipe.R script.  Please see
	   the instructions below.



Customizing the Code
---------------------------------------

The edgeR_pipe.R script may need to be updated to point at your edgeR
software install. This can vary depending on how you added it to 
your system. 

When installing as a non-root user, edgeR will often be installed to a
home directory path like: 

   ~/R/x86_64-pc-linux-gnu-library/3.0/edgeR/

This path will vary depending on the versions of Linux and R you're using.

It is also possible to install edgeR elsewhere on your system. If you do, 
it may be necessary to modify the edgeR_pipe.R code to point at that path.

Line 20 of the script currently reads:

   library(edgeR)

if your edgeR package is installed elsewhere, you may point to it by
changing the code to:

   library(edgeR,lib.loc="/path/to/your/R/lib")




MAP-RSeq data preparation
---------------------------------------

The GeneCount.csv file produced by MAP-RSeq can be found in the "Report/" 
output sub-directory. It contains a list of gene names, their chromosome, 
start and stop positions, length, gene counts, and RPKM values.

To perform differential expression analysis, this data must first be 
converted to the edgeR input format. The included script:

  GeneReport2EdgeR.pl

Perform the conversion by piping the GeneCount.csv file through this script:

cat Report/GeneCount.csv | perl GeneReport2EdgeR.pl > Report/EdgeR.GeneCount.csv

The updated will be located at:

   Report/EdgeR.GeneCount.csv

This file contains the following fields:

   GeneID, SAMPLE1  SAMPLE2  SAMPLE3
   CCNF        100       10      200
   ...

followed by the gene counts for each sample.  The "SAMPLE#" names are 
arbitrary and drawn directly from the MAP-RSeq run_info.txt file. They will
be named however you specified them.

To perform differential expression we need to replace the sample names
with appropriate group names. For instance if SAMPLE1 and SAMPLE3 were in 
group A and SAMPLE2 was in group B you would need to re-arrange the 
EdgeR.GeneCount.csv as below: 


   GeneID, GROUP_A  GROUP_A  GROUP_B
   CCNF        100      200       10
   ...

Where the first GROUP_A column was SAMPLE1, the second GROUP_A column was 
SAMPLE3 and the GROUP_B column was SAMPLE2. 

Only two groups are supported by this code. 



Executing the Analysis
---------------------------------------

Execute the differential expression analysis with 

   MAPSeq-DE.pl -f /path/to/EdgeR.GeneCount.csv \
      -c GROUP_A_NAME -nc numSamplesInGroupA \
      -t GROUP_B_NAME -nt numSamplesInGroupB


An example dataset is provided. You can execute a test run with:

   perl MAPSeq-DE.pl -f example.GeneCount.txt -c GROUPA -nc 3 -t GROUPB -nt 3


This will produce the following output:

   all.GROUPA.vs.GROUPB.csv

      This contains TSV data with the following fields

      Gene - the gene for which the expression is measured
      logFC - log fold change
      logCPM - log counts per million 
      PValue - pValue for the Cmn.Disp (Common Dispersion) 
      FDR - False discovery rate for Cmn.Disp
      PValue - pValue for the Tgw.Disp (Tagwise Dispersion) 
      FDR - False discovery rate for Tgw.Disp
      Cmn.Disp - Common Dispersion
      Tgw.Disp - Tagwise Dispersion
      UpDown.Cmn - Up/Down regulation rate computed via common dispersion
      UpDown.Tgw - Up/Down regulation rate computed via tagwise dispersion
      GroupA	- gene counts for the first sample in Group A
      GroupA.1 - gene counts for the second sample in Group A
      GroupB	- gene counts for the first sample in Group B
      GroupB.1 - gene counts for the second sample in Group B
      ...

   Along with various plots of the data:

      Differential expression_all genes plot.png
      MA plot_all genes.png
      MA plot_top 500 genes.png
      MDS plot.png
      Mean Variance plot.png

