
This README describes the GeneSetScan program as a whole, and describes 
what has been provided in the sub-directories.

GeneSetScan is a method for scanning gene sets for association in GWAS, 
as described in this paper:
 
Schaid DJ, Sinnwell JP, Jenkins GD, McDonnell SK, Ingle JN, Kubo M, Goss PE, 
Costantino JP, Wickerham DL, Weinshilboum RM. "Using the gene ontology to 
scan multilevel gene sets for associations in genome wide association studies."
Genet Epidemiol. 2011 Dec 7



MEMORY AND DISK SPACE USAGE:
----------------------------------------------------
These are the two main issues to consider before running this program.
First, we define "memory" as system memory needed while the program is 
running from start to finish, and we define "disk space" as the hard-disk 
space needed to store any file on your system.
 
We have tested GeneSetScan on various datasets that each 
were genotyped on ~500K SNP chips, ranging from 1000 to 2500 subjects. 
The disk space needed for the large SNPScore.txt file in those studies
ranges between 8GB-13GB, and the memory needed to run the program was 
4-14GB, depending on how large the gene-sets are. Because the machine 
language commands for programs compiled on 32-bit machines cannot index more 
than 4GB of memory, GeneSetScan should only be run on 64-bit machines.  
Furthermore, it is preferred that they have upwards of 8GB of memory 
available during run-time.


RUN-TIME:
---------------------------------------------------
The run-time for GeneSetScan ranges between 1-3 hours on our system with 
the default of 1000 simulations.  The program re-uses memory after every 
simulation, so there is no memory cost to doing more simulations. However, 
there is a linear increase in time to perform more simulations.


SUBDIRECTORIES AND WHAT THEY CONTAIN:
--------------------------------------
bin
--------------------------------------
contains GeneSetScan executable, version 0.01-beta, compiled by the 
gcc (version 4.1.2) compiler for x86_64 (64-bit) CentOS (RedHat) 
linux distribution.


----------------------------------
doc
----------------------------------
Includes LICENSE file, which describes the license we distribute
this softare under. We also provide a user manual in this directory.


---------------------------------
data
---------------------------------
Data files needed to run GeneSetScan.  All sources for data were downloaded 
in March 2011, and some details for how the data were created are given in
DATA-LIST. Before you can use this package, you need to make sure all the 
data files are unzipped; we provide instructions in INSTALL.


---------------------------------
example
---------------------------------
We provide a script, rungss.sh, to show how to call GeneSetScan with parameter
files provided in the same directory. The parameter files are made to be fully 
operational from within the /example directory with correct paths to all data 
sources that we provide, except that the SNPScore.txt file does not exist. 

The parameter files are set up to do gene and gene-set scoring that is used
in the paper, and that we currently recommend. Those settings are:
gene-level: sqrtmean
gene-moments: nor
gene-set-level: wtdmean (where applicable)
  
The par files for the 3 GO Namespaces, bp (Biological Process), 
cc (Cellular Component), mf (Molecular Function), are all similar, 
with the only difference being the go_type. We also provide an example
of how to do the complete GO structure with go_type set to "all". 

The par file for gene tests has method of GENE, and requires neither a 
gene-set-level scoring, go_edges file, nor a go_type.  

The par file for KEGG pathways has method of KEGG, has a larger max_size to 
allow for large pathway size (though there are far fewer gene sets than GO)
and has kegg_gene specified rather than go_edges and go_gene.


-------------------------------------
SNPScores
-------------------------------------
README.snpscores describes the steps we suggest for creating the SNP scores for 
each subject at all markers.  We provide some sample shell and R scripts for
creating the SNP Scores one chromosome at a time from plink binary (bed) files 
and adjusting by covariates that are in a text file.

Some of the steps are generalized with incomplete file paths and a non-specific 
call to plink, so we suggest editing these scripts to work for your systems.
