HiChIP Pipeline

HiChIP: A high-throughput pipeline for integrative analysis of ChIP-Seq data

HiChIP pipeline is designed for performing comprehensive analysis of chromatin immunoprecipitation and sequencing (ChIP-Seq) data. It can be used to analyze profiles from transcription factor binding, histone modifications, histone variants, and chromatin regulators. Paired-end and single-end NexGen sequencing data from ChIP experiment with different antibodies, with or without biological replicate, can be analyzed accordingly. It has been tested using human and mouse ChIP-Seq datasets, and should be applicable to other organisms provided the references are available.

It contains five major steps:

  1. Reads quality check  using FastQC
  2. Reads mapping via BWA, processing of mapped reads and library quality assessment
  3. Peak calling and consistency analysis between replicates
  4. Data visualization and summary report
  5. Downstream analysis.

For single-end mapping, the mapping output bam file will be processed to parse only uniquely mapped reads or uniquely mapped reads plus a single random hit from multiple mapped reads. For paired-end mapping, three options have been implemented to allow the extraction of pairs with both ends being uniquely mapped, pairs with at least one of the two ends being uniquely mapped, or pairs with at least one of the two ends being uniquely mapped, plus a randsom match from properly mapped pairs with multiple matches. Optionally, these retained reads will be further filtered to remove low mapping quality reads using Samtools and duplicates using Picard.

The library complexity will be calculated from the BWA output bam file as the ratio of the number of duplicate filtered reads over the total number of uniquely mapped reads, which indicates the level of genomic coverage at the sequencing depth of the library.

Two peak finders, MACS and SICER, have been implemented to identify binding sites from the above filtered reads. MACS is one of the most popular peak finding packages designed primarily for analyzing punctate binding events, and SICER is a package specifically developed for scoring broad binding events. The identified peaks will be assigned to the nearby genes whose transcription start site or end site is within the maximal distance from the peak center (default: 10 kb) using in-house tool.

This pipeline also uses the IDR (irreproducible discovery rate) package to perform consistency analysis between biological replicates, by which a set of highly reliable peaks can be identified for downstream analysis.

For data visualization, we use Bedtools in combination with in-house to generate raw tag density profile at a user-defined step size (20 bp by default) and fragment size (200 bp by default). The raw tag density will be normalized to represent number of tags per million mapped reads, and the resulting Bedgraph, Wig and TDF file could be uploaded to a genome browser for visualization.

We also include modules for de novo motif discovery using MEME, binding profiling over key genomic features using CEAS, and GO enrichment analysis using an in-house tool.

Pipeline Process

The workflow can be configured to run on a single Linux machine as well as in a cluster environment to fully leverage multiple processors.

Reference Files:

Page last modified: March 9, 2015