
CAP-miRNASeq Pipeline
A comprehensive analysis pipeline for deep microRNA sequencing

Introduction
-------------------------------------------------------------------------------

miRNAs play a key role in normal physiology and various diseases such
as cancer. Hybridization based microarray technology has been used
for miRNA profiling, but is hindered by its narrow detection range,
more susceptibility to technical variation, and lack of ability to
characterize novel miRNAs and sequence variation. miRNA profiling through
next generation sequencing overcomes those limitations and provides a
new avenue for biomarker discovery and clinical applications. However,
analyzing miRNA sequencing data is challenging. Significant amount of
computational resources and bioinformatics expertise are needed. Several
analytical tools have been developed over the past few years; however
most of these tools are web-based and can only process one or a pair of
samples at time, which is not suitable for a large scale study with tens
or even hundreds of samples. Lack of flexibility and reliability of the web
service (such as outdated references, unknown parameters used, server down,
and slow performance) are also common issues. Although some tools provide
differential miRNA analysis, they either limit to a pair of samples or use
a model not suitable to a study design. Moreover, miRNA SNVs or mutations
become increasingly important but none of the tools provide SNV/mutation
detection. Herein, we present a comprehensive analysis pipeline for deep
microRNA sequencing (CAP-miRSeq) that integrates read preprocessing,
alignment, mature/precursor/novel miRNA qualification, variant detection
in miRNA coding region, and flexible differential expression between
experimental conditions. According to computational infrastructures, users
can run samples sequentially or in parallel for fast processing. In either
a case, summary and expression reports for all samples are generated for
easier quality assessment and downstream analyses. Using well characterized
data, we demonstrated the pipeline's superior performances, flexibilities,
and practical use in research and biomarker discovery.

The package and full documentation can be downloaded from the CAP-miRSeq website: 

http://bioinformaticstools.mayo.edu/research/cap-mirseq/



Installation
-------------------------------------------------------------------------------

There are two install options for the CAP-miRNASeq pipeline, Quickstart
Virtual Machine and an install to your local environment. The Quickstart
Virtual Machine contains all the software, references, and sample data
necessary to test the pipeline. It can be configured to run your sample
data provided a system with adequate storage and memory.  The local install
requires that you install or locate already-installed versions of the
software dependencies required by the pipeline. These are listed later in
this document.  The included scripts/check_install tool can help locate
these pre-requisites on your system and populate the tool_info.txt file
the pipeline uses to execute them.  Quickstart Virtual Machine



The Quickstart Virtual Machine
-------------------------------------------------------------------------------

This requires either Oracle's VirtualBox software available free for Windows, 
Mac, and Linux at:

https://www.virtualbox.org/

Or VMWare Player, available free for Windows and Linux at:

https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0

We recommend VirtualBox.  Please see the complete instructions with screenshots in 
the CAP-miRNASeq_UserGuide.pdf available at:

http://bioinformaticstools.mayo.edu/research/cap-mirseq/



Performing a Local Install
-------------------------------------------------------------------------------

The CAP-miRNA pipeline relies on a large number of system packages and 
bioinformatics tools.  Please make certain that the system packages are are all
installed before proceeding with the bioinformatics tool installs.  See the 
"System Packages" section below for details.

Next, there are several non-system package pre-requisites which must be installed
manually:

	GATK

	Java

	R and it's Bioconductor packages

Details on where to download and how to install these are outlined in the 
"Other Pre-Requisites" section below.  We also include brief instructions on 
building each of the bioinformatics tool dependencies manually should you
need to do so.

Last, an install.sh script is provided that attempts to download and build most 
of the CAP-miRNA prerequisite tools (except those mentioned above).  See the 
"Install Script" section for details on using this to build the tools necessary 
to run the pipeline.

It's likely that your infrastructure already has many of these installed. If 
that's the case, you can configure the sample_config/tool_info_hg19.txt file 
at your existing copies and run:

scripts/check_install -t sample_config/tool_info_hg19.txt 

to confirm that the configuration is complete. 

Below is a list of all required packages and tools. 


System Packages
-------------------------------------------------------------------------------

The CAP-miRNA pipeline relies on the following system packages. 

On an Ubuntu 13/Debian 7 based system, apt-get install the following:

		build-essential
		dos2unix
		gfortran
		libgd2-xpm-dev
		libice-dev
		libncurses5-dev
		libpdf-api2-perl
		libpdf-api2-simple-perl
		libreadline6-dev
		libxt-dev
		python2.7-dev
		python-dev
		python-matplotlib
		python-numpy
		ttf-mscorefonts-installer
		unzip
		zlib1g-dev

On a RedHat 6.x/CentOS 6.x based system, yum install the following:

		dos2unix
		g2-devel
		gcc
		gcc-gfortran
		gcc-c++
		gd-devel
		gd-progs
		libICE-devel
		libXt-devel
		make
		libpdf
		numpy
		python-devel
		python-matplotlib
		readline-devel
		zlib-devel


Other Pre-requsites
-------------------------------------------------------------------------------

You must manually install GATK, Java, R and the Bioconductor/EdgeR packages.
	
	GATK
	http://www.broadinstitute.org/gatk/auth?package=GATK
	Download and unpack. Then place the GenomeAnalysisTK.jar in your 
	$PATH ( /usr/local/bin/ )


	Java 7
	http://www.oracle.com/technetwork/java/javase/downloads/
	If you choose to install via a .deb or .rpm, system paths should
	automatically be set correctly.  If you choose to install via the
	tar.gz file, you will need to update your $PATH to point at the
	JDK's bin directory and set the $JAVA_HOME evironmental variable.


	R
	http://cran.wustl.edu/src/base/R-3/R-3.0.2.tar.gz
	This install requires two parts. Installing R itself and then 
	installing the bioconductor modules.

	Install via:
		tar xvfz R-3.0.2.tar.gz
		cd R-3.0.2
		./configure
		make
		sudo make install

	It's possible to install R without sudo privileges. When running ./configure
	specify a prefix you have write access to. Be sure to add the R bin directory to
	your PATH. If you do this, you can omit the sudo in the R invocation below for
	the bioconductor install.

	Additional R packages are required for the CAP-miRNA pipeline.
	Install via:
		sudo R 
		This will launch the R interpreter. Then run:

		source("http://bioconductor.org/biocLite.R")
		biocLite()
		biocLite("edgeR")
		q()
		Answer "Y" to the save the workspace prompt.



Install Script
-------------------------------------------------------------------------------

Once the system packages, GATK, Java, and R are installed, the remaining tools 
necessary to execute the CAP-miRNA pipeline can be installed via the included 
install.sh script. You can run it via:

./install.sh -p /path/to/software

Where "/path/to/software" is the location the script will create the binaries. 

You must meet all the system package requirements listed above prior to running
install.sh or the builds will fail. 

You must have already installed Java, GATK, and R prior to running install.sh or 
the build check will fail.

If you need assistance debugging a failed build please feel free to send a 
transcript of the install.sh output to bockol.matthew@mayo.edu and we can help 
you to identify the problem.

Once the install completes building the software it will check each tool by
running: 

scripts/check_install -t sample_config/tool_info_hg19.txt 

This tests whether everything can execute properly. If modules are missing it
will provide a summary to guide your troubleshooting. Please see the 
check_install section below for more detail on this script.

The install.sh and check_install scripts update the paths in your

sample_config/tool_info_hg19.txt

file, so the pipeline can find all the necessary software and references.



Running the test samples
-------------------------------------------------------------------------------

Once the install is complete you can execute the pipeline against the included
test samples:

	scripts/CAP-miRseq.sh sample_config/run_info.txt

Example output will be created in the sample_output/ directory.


Bioinformatics Tools
-------------------------------------------------------------------------------

The workflow will need access to the bioinformatics tools listed below.  We 
have provided the versions and sources tested with the pipeline.  Other 
versions may work fine.  For many packages, installation is as simple as 
unpacking the archive and placing the files somewhere in your $PATH. For others, 
we have provided basic install instructions.  These assume you have sudo access 
on your host. Most of these tools can be installed as an unprivileged user by 
using a "--prefix=/path/to/install" or similar option in the setup or configure
scripts. Please see documentation in each package for details.

PLEASE NOTE: All of these packages can be automatically downloaded and installed 
via the install.sh detailed above. If you executed that script successfully you 
may safely skip the installs below.

	These tools can be installed via install.sh:

		BedTools v2.17.0
		http://bedtools.googlecode.com/files/BEDTools.v2.17.0.tar.gz
		make
		sudo cp bin/* /usr/local/bin/ 


		CutAdapt 0.9.5
		http://cutadapt.googlecode.com/files/cutadapt-0.9.5.tar.gz
		Install via:
			python setup.py build
			sudo python setup.py install


		FastQC 0.10.1
		http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.10.1.zip
		Unpack and update your $PATH to point at the unpacked directory.


		HTSeq
		https://pypi.python.org/pypi/HTSeq
		Install via:
			python setup.py build
			sudo python setup.py install


		Samtools
		http://downloads.sourceforge.net/project/samtools/samtools/0.1.19/samtools-0.1.19.tar.bz2
		Install via:
			make
			sudo cp samtools /usr/local/bin

			
		Picard 1.77
		http://downloads.sourceforge.net/project/picard/picard-tools/1.77/picard-tools-1.77.zip
		Unpack and place the picard directory in /usr/local/
		Update your $PATH to point at the /usr/local/picard-1.77 directory.


		VCFTools v0.1.11
		http://downloads.sourceforge.net/project/vcftools/vcftools_0.1.11.tar.gz
		Install via:
			make
			sudo make install 


		miRDeep 2.0.0.5
		https://www.mdc-berlin.de/38350089/en/research/research_teams/systems_biology_of_gene_regulatory_elements/projects/miRDeep/mirdeep2_0_0_5.zip

		miRDeep's install.pl will attempt to install the following packages as well:
			bowtie
			Vienna
			Squid
			Randfold
			PDF-API2

			PLEASE NOTE:  In our experience the included mirdeep install script is broken. 
			The install.sh included with the CAP-miRNA pipeline replaces it, but you can download the
			components manually if you wish. 


		Bowtie v0.12.7
		http://downloads.sourceforge.net/project/bowtie-bio/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip


		Vienna 1.8.5
		http://www.tbi.univie.ac.at/~ivo/RNA/ViennaRNA-1.8.5.tar.gz

		Install via:
			./configure
			make
			sudo make install

			PLEASE NOTE:
			Building the RNAForester component of Vienna can fail on some 
			systems.  We have worked around it by upgrading the embedded G2 
			graphical library package included with the source. The original 
			source code includes:

			ViennaRNA-1.8.5/RNAforester/g2-0.70 

			You can download the 0.72 release here:

			http://downloads.sourceforge.net/project/g2/g2/g2-0.72/g2-0.72.tar.gz

			Because Vienna's configure and Make files expect the g2-0.70 
			version, it's simplest to erase that version and rename the 
			unpacked 0.72 version so it appears in the same place, for example:

			rm -r ViennaRNA-1.8.5/RNAforester/g2-0.70 
			tar cvfz g2-0.72.tar.gz
			mv g2-0.72 ViennaRNA-1.8.5/RNAforester/g2-0.70 

			Then you should be able to run the ./configure, make, sudo make 
			install without an error.


		Squid 1.9g
		ftp://selab.janelia.org/pub/software/squid/squid-1.9g.tar.gz
		
		Install via:
			./configure
			make
			sudo make install


		RandFold v2.0
		http://bioinformatics.psb.ugent.be/supplementary_data/erbon/nov2003/downloads/randfold-2.0.tar.gz
		
		Install via:
			make
			sudo cp randfold /usr/local/bin/


		PDF-API2 perl module
		http://search.cpan.org/CPAN/authors/id/A/AR/AREIBENS/PDF-API2-0.73.tar.gz
	
		Install via:
			perl Makefile.PL
			make
			sudo make install


Tool Configuration
-------------------------------------------------------------------------------

Once all the tools are installed, you will need to let the pipeline know
where to find them. This is done via the toolinfo.txt file.  If you've 
successfully followed the install steps above then the install.sh and 
check_install scripts have already configured your toolinfo file.

If you're attempting to configure the toolinfo file against pre-existing tools
on your system, you can use the check_install script to test whether they are
being found properly.

Update the toolinfo file with the paths to your binaries, then run:

./check_install -t /path/to/your/toolinfo.txt

And it will check each tool, prompting you for an updated path if it cannot
find the correct one. See the check_install script section below for more
details.



The Toolinfo File
-------------------------------------------------------------------------------

In the sample_config directory you will find an example tool_info_hg19.txt
file. It is divided into four sections:

	# Tool Paths

	# Tool Parameters
	
	# Reference Files

	# Memory Parameters

An explanation for each entry is available in the full documentation on the 
CAP-miRNA website. 

http://bioinformaticstools.mayo.edu/research/cap-mirseq/

Each field should be updated specific to your particular installation. You 
can do this manually or use the check_install script to update the file.



The check_install script
-------------------------------------------------------------------------------

The check_install script will read in an existing tool_info.txt file and check
each entry. You can invoke it with:

scripts/check_install --toolinfo /path/to/your/tool_info.txt 

As an example:

user@host:~/CAP-miRNA$ ./scripts/check_install --toolinfo sample_config/tool_info_hg19.txt 
Checking for Bedtools -- Failed for the following reasons:
	Path "bedtools" does not exist
Try Again (y/n): y
Bedtools
Please specify a path (bedtools): /usr/local/bedtools/bedtools
Checking for Bedtools -- OK

Providing the full path to the bedtools binary satisfies the requirement, and the
tool_info_hg19.txt file will be updated with the following entry:

BEDTOOLS_PATH=/usr/local/bedtools

The new tool_info file is only saved once the script is complete.




Running Samples
-------------------------------------------------------------------------------

Once all of the tool_info fields have been correctly populated the pipeline should be
ready to use. 

Sample data and references are provided.  You can update the following fields in the 
sample_config/run_info.txt :

INPUT_DIR=/home/mayo/sample_data/
OUTPUT_DIR=/home/mayo/sample_output
TOOL_INFO=/home/mayo/sample_config/tool_info_hg19.txt
SAMPLE_INFO=/home/mayo/sample_config/sample_info.txt

So that they each point at the corresponding path on your system.  Please be sure that the 
TOOL_INFO path is the one with all the values set through the process above.

Execute the workflow with:

scripts/CAP-miRseq.sh sample_config/run_info.txt

Assuming no errors, the results will be written to the OUTPUT_DIR. Begin by examining the 
Main_Document.html file you should find there.


The next steps will be configuring it for your particular samples.  

Please see the full documentation for doing this in the pipeline user guide available on
the CAP-miRNA website:

http://bioinformaticstools.mayo.edu/research/cap-mirseq/

