XXX This documentation is obsolete and refers to the first SNPPicker beta code.

This package contains a program CommonSnps that take as input multiple tag-snps input files with BINs pre-computed.
The output of CommonSnps produces a minimal set of tagging SNPs that can tag multiple populations.
This was developed under Java 1.4.2 but may work with 1.3 

java -jar CommonSnp [-s score_file] [-o outputfile] [-e errorfile] -i file -p popname -i file -p popname

IO parameters
[-o outputfile]		file where to send output of program (default stdout stream)
[-e errorfile]		file where to send stderr (default stderr stream)

parameter files
[-s  scorefile	contains the snpid, a designability flag (0,0.5,1). If a SNP is not in this file, it assumes the default score]
[-S default_score  		default score, if a SNP is not in the scorefile. 
[-D default_design		default design score]
[-V default_validation 	default validation class .. string]
[-c configfile		config file listing the rules for picking multiple SNPS/bin, and for assigning Probabilities to Scores (Default is Illumina)
					the config file can be used to change the format to parse.. and change the default probabilities.

[-o obligate_include file containing snp ids of snps that must be included]   a SNP cannot be both obligate include and excluse.
[-x obligate_exclude file containing snp ids of snps that must never be picked. a SNP cannot be both obligate include and exclude

Each population is input by a pair of parameters
-i file 		ld-select or tagzilla input format file
-p popname		String with population name
[-l filename    File containing the pairwise r^2, to use with -optimal]

Options 
-cpu nnnnn [maximum cpu time per cluster in seconds, default is 600=10 minutes]
-X nn [where nn is the minimum number of base pairs between SNPS 60.. unless they can be assigned to multiple OPA's]
-O nn [ Number of different assay batches (OPA's for Illumina) where the overlapping SNPS can be distributed, 0 means to not care]
-optimal [if this flag is set, the algorithm will try to find a truly optimal solution by allowing bin-breakup]
-success [if this flag is set, greedily select the SNPs with the best probability of success only, population overlap is incidental]
-U nn [ if non-zero, prevent selection of tag-snps that are closer that -U nn bases. Uses scorefile information to get SNP position.
	  // U stands for "Under", e.g. SNPs under the probe
	 // Minimum Distance between SNPs (default 0)
                    // e.g. if a SNP has neighbors that are too close, ignore it.
                    // scorefile should only contains SNPs relevant for the population of interest, but
					// should contain SNPs with freq>0 (and thus above user cut-off) if they need to be
					// considered for snpUnderProbeDistance>0 tooClose test
-U1 nn // require that at least one side have nn bases between this SNP and the next.

To understand how this code works, one has to consider that a bin of SNP in one population is not spatially contigous, nevertheless,
we can think of it as a Node in a graph. SNPs common to more than 1 bin form an edge to that graph. An edge can have more than 1 bin.
The normal operation of this code is to first cluster (using single linkage clustering) all bins that are linked by an edge.
Bins with no link allow the selection of a single optimally scoring solution (if there are two equal solutions, one is selected at random).
Pairs of bins with only one edge (an edge can have multiple SNPs) between them are also easy to pick.

The algorithms is "Greedy Solution with refinement"

We find a decent solution and look at all possibilities up to the CPUtime/cluster.. and keep overall best solution or
or best solution found so far.

The idea of the algorithm is to initially generate an initial solution that is optimal in some sense.
	e.g. Greedy Minimum Number of Tagging SNPs or Greedy Optimal.
	
   sort the SNPS per genotyping success probability and the number of bins it touches.

     initially we go through each cluster
      successively pick the most likely SNPs that touches the most bin to yield a successful genotyping assay.. 
     Once a bin has NSB snps, this breaks up the cluster.. and
      we can then focus on the subclusters.

Then we use this solution as the seed solution and try to replace i SNPS in solution with n snps not in solution.  



# Looking back at a greedy solution.(or most reliable)
  we have 
    snps_in_solution= snps_picked_in_greedy_solution+obligate_snps+snps_fixed_by_too_close
    snps_not_in_solution = snps_not_picked + snps_excluded_by_input+snps_excluded_by_too_close
    
    For Each SNP, store the pattern of Bins it touches in each Pop.. as well
    as the Uniqueness of that SNP for each bin.
    
    Use this info to split SNPs into
  @unique_selected_snps + obligate Selected SNPS
  @redundant_selected_snps
  
 
  "overlap_redundant" = For Each redundant selected SNP, find non-selected SNPs who overlap in bin coverage
  

    # this hash keeps track of which discarded SNPs are "singly connected"; each such
    # SNP overlaps with only one redundant selected SNP

    my %is_snp_singly_connected;


    $discarded_snp_info{$discarded_snp}->{overlap} = \@overlapping_selected_snps;
    $discarded_snp_info{$discarded_snp}->{other} = \@other_selected_snps; (non-overlapping selected snps)


    # cluster discarded SNPs based on shared overlapping selected SNPs 
    # (don't cluster singly connected SNPs, they only have one alternative, and if the POP pattern
    #         is the same .. they should have already been selected)

    

 
    Create binary "instructions" (choices of subset) for all possible subsets of SNPs already picked
       and in cluster
     .. and all possible subsets not already picked  and in cluster
 
 
    Loop Over Clusters
      Find SNPs in solution that are not to be replaced.
         non-replaced_snps =(intersection of "other" SNPS) by any combination.
 
      Loop over not-in-solution patterns
         Loop over in-solution-and-overlapping patterns
	  Quick-checks to see if can eliminate the solution based on the number of SNPs.
         Create subsets
         	
         	replacement_snps =formerlydiscarded_snps_that_would_replace_insolution_snps
         	not_combined =formerly_discarded_snps_in_cluster_but_not_selected
         	replaced_snps(formerly in the solution)
		snps in new solution = replacement_snps+non_replaced_snps+not_combined
				
	 if this is a better solution replace old one.
	   If this is an equivalent solution, save it



Introducing constraint of obligate and excluded SNPs does not change the algorithm.
Introducing the constraint of needing to pick Multiple Snps/bin for large bins (see -c configfile) 
does not change the algorithm much, except that it changes the weight of edges.

Introducing the constraint to limit snps to be no closer than a certain distance requires us to pre-process the SNPs.
However if those SNPs are now considered first, then the lower part of the dynamic  search can proceed with the heuristic.

Optimality of the Solution:
By respecting the bins constitution, no further optimization is possible using a local algorithm. However, if we allow a bin to
be tagged by N-tags (instead of 1), it is possible to reduce the total number of SNPS. Consider a 
triumvirat of overlapping bins: bins A,B,C. Bin A is in the "center" and overlaps two other bins B and C 
in their SNP composition, but NO tag-SNP of bins B and C overlaps a snp which is also a tag-SNP for A. 
The initial solution would require 3 tag snps. If we can find a di-tag (two SNP tag the whole bin) for
A, where the two tags in the di-tags are not tag-snps of their own merit, but those snps are tag-snps in Bin C and D. 
In that case only 2 SNPs are necessary to tag these 3 bins. The di-tag tags bin A, and one snp in the di-tag tags bin B and the other SNP tags bin C.
The same situation can happen with an N-tag and N bins. 
A 3-umvirate and a 4-umvirate both lower the SNP count by 1. For N>2, the N-tag pattern has an 
additional twist, where we can also have patterns where A in Population 1, overlaps with B in 
population 2, C in Population 2, and D in Population 3 where now C and D share some Snps. 


It is also possible that bin C is the center of another triumvirat with C,A, and D. The general solution 
is then to find all N-tags sets and cluster them using Single Linkage. Again do a dynamic programming 
cluster break up search over this n-tag break conversion and then do the optimal solution search as 
before. The dynamic break-up and exploration of solutions also takes care of the multi-population twist 
(or multiple bins in different population overlapping with the N-tag)
Once a N-tag solution gets explored, the SNPs that are part of that N-tag become obligate.

Goals:
------
Find fast a pretty good solution if there exist any solution. To that extend, dynamically explore all solution 
where there might be conflicts. If there are conflicts, handle them reasonably well by excluding one bin if need
be. No guarantee that the bin we chose to exclude to resolve the conflict was the absolute best choice, but the algorithm
is designed so that in most cases, it is the least-costly bin to exclude.

There are 3 types of users: 
	Users with too much science to do and not enough money.(most people)
	One type of users has a limited number of SNPs they can do, and they have too many SNPs to do.
	For them, the choice is between complete power for 1 area versus coverage (more Bins=More coverage) vs complete power = minimizing
	the number of SNPs that fail.
	
	Users with a fixed amount of genotyping science to do and interested in savings.
	Another type can want complete power with the minimum number of SNPs.. Eliminating a SNP doesn' afford them
		the opportunity to genotype another bin. These users don't care about the cost too much, they'll take
		a reduced number of SNPS, but not at the cost of reduced probability of genotypes. The scoring function
		does not use coverage as a criterion.
		
	Users who have a fixed amount of genotyping and want the absolute best results, money no is no object.
		Then genotype all the SNPs and don't use this program.

Optimality of Solution:
-----------------------

If we forget the too-close constraint and having multiple SNPs/bins, this is the vertex covering problem for N=2, which
for general N is NP-hard (one of the 21 original NP-hard problems). (e.g. pick the smallest subset of edges(SNPs) 
where the endpoints will cover all the edges).
The N=2 is a special case, this is a bi-partite vertex covering, which is equivalent to the maximum weight matching problem for N=2
For N=2, an optimal linear time algorithm is known. For N>2 our problem is no longer a vertex graph covering
problem because our edges (a SNP) can now go through multiple nodes (bins in different population). 
Nevertheless we can learn from algorithms to solve those problems. The greedy algorithms for this kind of
problem successsively picks the highest weight edges (weight calculated in the original graph). Our algorithm is 
like a greedy algorithm, except that we consider the currently highest weight edge and that our weight is
a function of the topological density.

For the N=2 vertex cover problem, the pure greedy algorithm is known to provide solutions that are at worst
2X worst than the best possible solution. In practice, simulations show a 2-5% worst behavior than optimal.
Our algorithm is expected to have better optimality behavior than this.

From the litterature, we know that solutions that consider not only the local weight of the SNP, but
the weight of distant paths provides better optimality. Our Cluster Breakup strategy does exactly that, by
honing on the densest regions of the graph.

If we allow clusters to break up, our algorithm will find the first level of improvement, but it is does not
completely explore the optimal solution space.

For example:
Imagine that we have two partial N-umvirates. 
For the first partial N-umvirate (A center, and B, C edges) (Sat C edge does not satisfy the criterias),
 it is possible to imagine that we could choose a di-tag for both A and C, and we might then be lucky that the one of the 
 di-tags in C perfectly overlaps a di-tag in A and that the other di-tag in C overlaps another tag in cluster D 
 (C and D also formed a partial di-tag).  Then the total number of tags before is A(1),B(1),C(1),D(1) ==> 4. After 
 it's A-B(1),A-C(1),C-D(1) =3. We could make things even better if we added a twist (another bin E that shared the 
 C-D Snp  in another population). A completely optimal algorithm would not only have to consider marriage of partial N-tags, 
 it would have to consider all paths that link partial N-tags through chains of splitting bins into n-tags.  While this algorithm 
 would still not be factorial in nature, it is hard to justify the added computation and programmatic complexity on the basis of 
 a small cost saving. Also, starting from greedily obtained clusters from single populations and trying to break them up may not be 
 the optimal approach for N-populations. A better strategy for the N population case, might be to try to generate haplotypes for N-populations by 
 reconstructing the parsimonious evolution of a few ancestral haplotypes. (or use a Bayesian Model Averaging approach on top of it)

We have chosen to halt the complexity of our algorithm at the N-umvirate level.

Including Obligate SNPs:
-------------------------------------
Obligate include SNPs are usually SNPs for which the user has prior knowledge from other studies or from
it's potential effect on the gene's function or regulation (e.g. SNPs in conserved regions or NS snps).
Another use of obligate SNPs is to add tag-snps to an existing panel assay based on either additional
data.. or based on having the opportunity to generate more assays.. so the run is rerun on the same data
and additional SNPs get selected. For this second strategy to work, the obligate SNPs must have been the
input to the initial bin-building algorithm.

We premark those SNPs as picked. If they are tag-snps, we also remove edges and update counts as appropriate.
This may result in more than Nsb tag-snps/bin.

If obligate SNPs create a cluster of too close SNPs, they may get rejected.

There is one problem that occurs when the user includes non-tag SNPs in the obligate-include list.
If we try to break-up a bin into an n-tag and there is an obligate tag-snp already in the bin.. but
there is an obligate snp in the bin that is NOT a tag snps? What's the right thing to do?
Break the bin.. and still maintain the total number of obligate?
We have decided to not break the cluster if there are any obligate SNPs inside it.

Excluding Obligate Exclude SNPs:
--------------------------------

Obligate Exclude SNPs are really a convenience for the user.. because they should in fact, simply remove
them from the input. It arises that the bins are built with SNPs that are later found out to be
undesirable for an assay .. or that have already been genotyped in the population of interest(multi-stage
or pilot study).

We remove them from the list of tag-snps. If this causes no tag-snp choice to be left for a bin, we consider
breaking up the bin in N-tags. In this case, we just pick the N-tag that maximized the population overlap.

If a tag-snp is in both obligate include and exclude, we report an error and abort.

Including multiple tag-snps per bin in the algorithm:
-----------------------------------------------------

If a bin has K obligate-include tag SNPs (where K is the number of wanted tag-snps for that bin), then 
it may only be selected for n-tag break-up if the n-tags include the obligate.(no additional snps 
selected) 

In that case, the criteria for an N-umvirate has to be changed. Consider the overlap between A and B. If A is required to have Ka
tag-snps and B Kb tag snps, then the criteria goes from
"Has n-tag candidate SNPs for A overlapping with a tag-snp for B, but no tag-snp of A overlaps tag-snps of B."
to
"Has n-tag  candidate SNPS for A overlapping with a tag-snp for B
   AND
   A and B have less then min(Ka,Kb)-1 overlapping tag-snps before the breakup
   
   AND the n-tag includes the obligate tag-snp
   
   AND do not pick any n-tag too close to another SNP (the least restrictive would
   		be that creating the n-tag does not create a cluster of SNP of size O)

   AND 
		    The Cluster break-up must guarantee a higher scoring solution.
			The sum of the K-tag population weights must be greather than
				- the sum of the weights of the already selected SNP 
					+ maxweight(of the top Nsb-selected) snps available for selection


   Do not consider the bin for break-up if there are any obligate tag-snps in it.
   Cannot pick an obligate-exclude SNP for selection.
   
   If it's large bin, only apply the multiple SNP/bin policy to the post-break-up bins.


Including the snps that are too close in the search:
----------------------------------------------------

SNPs that are too close present problems for Illumina and ABI SNPPlex custom panels.
If spread across multiple OPA (panels), they are allowed. the #O parameter allows to specify
that up to #O SNPs can be too close. 
#O==1 means that no two SNPs can be too close, so only pick one.
#O==0 means to reject any SNP too close to another one.
	Ideally these SNPS should be obligate exclude in the tag-snp picking algorithm.
#O==-1 means to ignore any proximity criteria.


Considering Snps that are too close is incompatible with our heuristic of choosing SNPS that touch the 
most populations. 
Because of that, we must dynamically explore SNPs solutions for SNPs that are too close ahead
of selecting any SNPs. This does worsten the algorithmic time scaling of our algorithm, but is only 
expected to be an issue for a small number of SNPs(~ 1% at typical HapMap resolutions)

Because we use a dynamic programming exploration of solution space, it does mean however that 
this our "too-close" feature should not be used to limit the density of genotyped
SNPs (e.g. SNPs no closer than 5KB).

We first look for solutions that do not involve breaking up bins.

For some too-close SNPs, there may not be any solution that respect the Bins.
(XXX In that case, the N-tag bin breakup strategy could also be used to find solutions for when SNPs are too close. We will
not attempt this in this implementation.)

(XXX The last thing to try is that although a SNP is not in a BIN, it might have LD>r^2 cutoff (owing to 
the way the greedy bin buildup algorithm works) with another single SNP.)

If we cannot find a solution at all, we have to reject some SNPS
this will occur naturally during the course of the dynamic programming and we just
skip over those snps. The only caveat is that we select the snp to reject based
on the order we chose to process bins/snps.. not necessarely based on a global criteria.
(future version).


Proximity algorithm:
Let #O = # of OPA's== maximum size of cluster of too-close SNPS
Let k = maximum proximity between Snps (snps's position cannot be at that separation or lower)

Start with the bin-edge clustering.
Assign Nsb according to rules.
Pre-tag obligates (excludes and includes)

// Process SNP that are too close
if( #O==-1) {
	skip this proximity option.
} else {
	Assign Nsb = Number of Snps for Bin. (from rule engine)
				  except that Nsb<=number of tag-snps in bins.
	Serially look at SNPs in order, create a growing List of List of clustered SNPS.
	Put together run of Snps that are too close.(e.g. position less or equal to k)
		as long as they are not obligate exclude.

	if #O==0 {
		then mark all those SNPs as excluded
		issue errors for true singletons (1 SNP one TAG)
		//if a bin with no leftover tag-snps has >=2 more SNPs
		//	could use the di-tag break-up
		
	} else {
	 	Consider only SNP Clusters of size >#O, assign snpclusterids
	 	// Merge snp clusters that involve common bins.
	 	consider the snpcluster a node, and the binid the edge.
		// Need to code a genetic single linkage clustering code
	 	For each snpcluster. collect the bins that includes those snps
	 	      and store in the bins, the snpclusterids.
		Create Linked list of Objects.
	 	Now loop over thebins and create edges whenever there are more than 2 snpclusterids.
	 	
	 	Collect the list of bins that contain the Snps involved in a cluster.	
			//Must assign up to Nsb spnps per bins until all the Snps
				//in the danger zone have either been selected or eliminated.
	 	
	 	// The goal is to find a solution where every bin is guaranteed Nsb snps
	 	//    ... not necessarely to pick the SNps unless they are in the danger zone
	 
	    Iterate over clusters of Bins and Snps
	 

	    snp_in_danger0 = Create an array indexed by Snpid to flag Snps in the danger zone.
	           //"danger zones" == (SNPs inside those zones may touch others)
	           // This array will get updated.
	    snp_in_danger = make a copy of that array that will get updated.
	    snp_excluded = Create an array indexed by SnpId to flag Snps that are excluded from selection.
	    


		mark as selected all SNPs with only a single choice
		if those marked Snps make up a cluster of size >#O, 
			remove every "Oth" one.(mark SNP as unavailable and reject bin) (XXX Here could do an exhaustive di-tag search)
		recluster snps if have set unavailable SNPs
		
		// 
		// Process Bins until no more SNPs in the updated danger zone can be processed.
		// Don't necessarely have to pick a SNP. If there are enough SNPs outside the
		// danger zone to pick, then mark leftover SNPs in danger zone as available
		// and let general algorithm.

		input: List of bins,List of Snps in Cluster,snp_in_danger,snp_excluded
		
		brute-force: Bin Picking order doesn't matter if there is a solution
				that involves all bins. So pick bin order that minimizes combinatorics and works through
					the bins with the most constraints first. (because will stop once CPU limit is reached)
					
		    create binTopick = Array of binids to pick a Snp from. (if have Nsb>1, put Nsb copies of Bin in List)
		    				   Order is Bins with only 1 choice
		    				   then bins with minimum amount of alternative to putting their SNPs in the danger zone(e.g. bins who have to pick the most of their SNPs in the danger zone)
							   and minimal number of alternative SNPs to pick in the danger zone
		    				      minimal snps_available_for_picking_not_in_dangerzone-Nsb .. if negative ==> must pick Snps in Danger zone
		    				      												   .. If positive or 0 ==> don't Have to pick Snp in danger zone
		    				      then, if equal, minimal snps_in_danger_zone
		    				   then the bins with minimal Snps_not_in_danger zone.
			create binFullFlag = array indexed by Binid, telling if a bin is full.
			create snpsOutsideDanger = array indexed by bin id, containing the number of snp outside danger zone.
			Loop over each element of binToPick
				loop over all single SNP choices per bin.
					(Choices are :
								best leftover tag outside danger_zone and 
								each leftover tag in danger zone in order of best scoring first.
					Mark pick SNP as selected as long as it does not create a size #O region of selected SNPs.
						(e.g. look at +/- #O)
				    else if this is the last SNP and was not able to make a single choice,
				    	then eliminate this bin.(if Nsb==1) .. (if NSb>1 and picked at least one SNP.. then
				    		just issue an error message.
				    
				    else move on to next SNP.
				       
				    if this is the last bin, last SNP processed, look to see if this is
				       the best scoring solution. (only add up scores in danger_zone)
				       
				       The best scoring solution maximizes ( bang for the buck)
				        (Expected # of SNPS represented)/ # of tag-snps genotyped)
					    To penalize solutions with low-design score SNPs, we can add a risk-averse factor.
					    risk_factor*Expected_Failure_prob*bin_size + expected_success*bin_size
					    
					    Expected_success is (1-Product_1_Nsb(1-P(SNP))
					    
					    For Illumina, the Design Score 1 have a prob of 0.95 and Design Score 0.5 have P of 0.9
					    with risk_factor=5, only Singleton bins are worth taking the risk of lowering the bin size.
				       e.g. maximize total_expected_number_of_SNPs successfully_represented_by_successful_tag_snps/# of tagsnps
				       	risk_factor=10 eliminates preferentially choosing a low-success SNP to merge even singleton bins.
						with very large risk factor, this will essentially eliminate ever choosing D=0.5 unless have no choice.
						
				       
				       
				       All other things being equal, take solution with smallest sum(functional category) wins.
				       
				       If a bin is excluded, the probability of failure is 1 and the prob of success is 0.
				       This takes care of comparing partial solutions.
				       
		            
		            
	 	       //If we expected a large number of bins, it is possible to mark as "excluded" once all Nsb
	 	       //   XXX Snps in a bin have bin picked.
	 	       //	XXX Snps leftover as unpicked in the danger zone (and not also in other bin) can be
	 	       //   XXX marked as excluded.
	 	       //	XXX This allows one to recluster SNPs in the danger zone as the excluded SNPs break runs of SNps.
	 	       //   XXX Left as a project for a student.
	 	       //   XXX then have to change the criteria for a good solution... as we won't pick everything.
			   //   XXX A good solution ends up with 
		       //   XXX no cluster of size >=#O 
			   //   XXX and either    at least 1 SNP/bin selected 
			   //	XXX         or at least 1 SNP/bin not unavailable. 
			   //   XXX and no snp in the danger zone that are not either picked or unavailable.
			
			
	How does this too-close SNP algorithm interact with the cluster break-up?
	  .. Once SNPs are fixed, restrict umvirate break-up to not pick k-tags that
	  	are too close to any other SNPs.




Contraints collisions:
----------------------

Our phylosophy is that the user wants a solution. Some combination of constraints will result in no solution.

The user may request too many SNPs per bin (e.g. too many obligates) .. we will issue solutions
	with more snps/bins than requested.
The user may request too many SNPs per bin (e.g. the rules) consistent with the too-Close or the total number of SNPS per bin.

The requirement of no too-close SNPs may result in no solution.. we try to reject the 
    break the bin.. and eventually reject the least important SNP.. and warn the user.

If the user specifies both obligate include and obligate exclude for the same SNP
	or if obligates are too close to each other.
   These are the only times the program aborts without attempting a solution. The fix is easy and it may signify
a problem in the user's pipeline.

Obligate includes tagsnps in a bin will halt the further break-up of a bin for either proximity avoidance or
for N-tag selection.

Competition between picking less tag snps (Snps that tag multiple bins) is handled by using a risk-reward model that maximizes
expected Expected number of snps successfully represented divided by the number of tag snps needed.. but with a penalty factor
to penalize losing snps. This also allows large bins with Nsb>1 to have some Design score=0.5 snps, if they are shared between
multiple bins.


Software options not included on purpose:
-----------------------------------------
We do not offer an option to select SNPs based on uniform coverage density. This is only a valid heuristic
if no genotypes are available to build LD. It also doesn't make sense to try to optimize both minimal number
of SNPs and uniform density, one has to win. We suggest the user who wants that, write a simple perl script
to do a post-processing pass and  add snps in region that are deemed too low density.

Complete Algorithm:
-------------------

Putting all those elements together, we have to first pick SNPs that are too close. In some cases, we can
outright pick them, in other cases, we have to also dynamically explore all solutions below. It is also possible
that SNPs that are too close will lead to no solutions.. In that case, we throw one out (and report the conflict).
 Next we use the N-tag cluster breakup, followed by the greedy cluster breakup algorithm on the rest.

Usage of SNP Categories: Use them to 

Default Snp categories were selected using the SIFT/polyphen categorization as well as dbSNP and other tools.
1+(importance)(1/(r^2)-1)/10  [should use exact r^2, but in this implementation, use parameter r^2]

importance	Snp Categories (they are arbitrary, since any String defined in configfile will override]
15		nonsyn-linked				Linked to a disease (suggest use obligate instead)
10		nonsyn-intolerant			computational prediction of intolerance
10		nonsyn-probablydamaging		computational prediction of intolerance.
10		nonsyn-possiblydamaging
10		coding-nonsyn                     (unknown syn)
8		nonsyn-tolerant
8		nonsyn-benign
8		splice
5		factor
5		polyasignal
5		initiation
5		genomicconserved
4		coding						unknown wether SYN or non-syn
4		3UTR
3		5UTR
3		UTR							Illumina category.. how come they don't know which UTR?
2		flanking_3UTR				Illumina category
2		flanking_5UTR				Illumina category
2		coding-SYNON							synonymous substitution
1		intron
0		unspecified					This is the default for no preference.
-1		repeat        				(Hard to genotype)
-1		microsat      				(hard to genotype)
-1		lowcomplexity  				(hard to genotype)
-5		indel						impossible to genotype on most platforms.
-5		msnp						impossible to genotype on most platforms




SNP Priorities: Greedy non-local cluster break-up.
SNPs that touch the most bins.
  to break ties, sum the number of other bins that are touching all bins touched by that SNP.


Once we get down to a single bin or pair(all multiple population SNPs have been picked), SNPs are picked according to
1) Designability (We fine grain it with 0-0.2 = 0, 0.2-0.4=0.25,0.4-0.6=0.5,0.6-0.8=0.75,0.8-1.1=1.0)
2) never pick snp designability of <0.5 (0.5 OK)
4) Within SNP Design Score Bin, rank by validation Class 


4) To maximize detection probability of a SNP in an important category, we have
   to correct for r^2 (e.g. other SNPs are not perfect proxy)
   to multiply prob of success by 1/r^2 .. but one is better off with an imperfect proxy of high chances of success
   instead of a snp with low odds of success. This achieves the right balance.
   Top rank gets full correction, lower ranks get proportionally smaller.
   
   
   This is in opposition to multipop select who picks location over probability of success.
3) Compute Score based on all SNPs chosen so far and considering all the other bins a SNP touches.
5) Else pick according to lowest rank SNP category.
6) Else the one with the smallest id (on the theory that since it has been around for longer, it had
    more of a chance to get validated)

When considering 2 or more bins linked by 1 SNP, one has to sum up the scores over all bins to decide
whether to take 1 SNP or not. This is the basis of our hypothesis testing.										  

   Loop over disconnected (Single Linked) SNP Clusters.
   Simpler Algorithm: 
        Step 1: Assign Pop-Pattern to SNPs.
        
        Step 2: Loop over SNPs sorted by Score Categories, then by size.
				Add SNPs as long as they contribute to 1 bin that is not saturated.
		
		We now have a baseline solution that selected the most reliable SNPS first.
		
		In looking for more optimal solutions.
			look for a solution where we end up with less total number of SNPS
			and yet lower score.
			OR
			Solution with more number of SNPS, but lower score.
					   
		So now have 2 categories of SNPs.
		SNPs that are picked and SNPs that are not picked.
					
		Can we pick combinations of unpicked SNPs that will replace picked SNPS					   
		Repeat the search for a skinking solition until there are no changes: (or until CPU limit reached)

			Shrink Solution: 	   
				We want to look for solutions where K snps that are not picked 
				replace P(>K) snps that are picked.
				Consider SNPS not in soln that share a bin with SNPs in solution.
					If a SNP not in solution touches 2 or more SNPs in solution, it might be contributing to the P-K difference.
					  if it touches a single soln SNP, skip it.
					  Now Cluster all multiply_connected_SNP_in_solution with SNPs_in_solution.
					  Define BC(P) = maximum coverage of each BIN.
					  				For Each category of SNP, we only need to consider the
					  					top S where S is determined by MAX(BC(P){bin i in category)
					  Bin coverage for TS is defined as the bin coverage provided and needed to meet
					  			the Nsb quotas.. so a SNP that contribute to overfilling a BIN is not necessary.
					  			
					  					[this should also limits singletons/bin to max(Nsb))
					  So define NOT_PICKED_MIN = subset of the not picked that meet the MAX(BC(P)) criteria
					  Consider all Combinations of picking TS SNPs (1.. #PICKED_IN_CLUSTER) that were picked.
					  	(if two SNPS picked are in the same class, only include the lower scoring in this search).
						Iterate though all combinations of selecting TS SNPs (order irrelevant) from PICKED_IN_CLUSTER
						(use recursive function with "callback" (e.g. Class.method)
					    Picking these TS SNP would create a bin coverage vector BC(TS)
					  	These TS SNPS will touch a subset KNS of the not-selected SNPS.
					  			Define KNSU as the subset of KNS that contains no SNPS in same class as snps in TS
					  			(otherwise we know that the snp already in TS beat the other one.. and if
					  				this SNP is not needed, then a smaller combination will beat it).
					  	
					  	We are looking for a subset K of those KNSU SNP such that BC(K)>=BC(TS) and score(K)>score(TS) 
					  		
					  	    solutions where BC(K)>=BC(TS) and score(K)>score(TS)
					  	    if(MINIMAL_SNP) is selected, then do not consider score, but only SNP count.
					  	We get to make one replacement per cluster per iteration.
					  	




Pre-tests: If a tag-snp is both obligate include and exclude abort. (this may be an error the user needs to correct)

Pre-filter, remove any SNPs with designability of 0 (unless they are obligate)

0) Single-Linkage Clustering of Bins according to edges (SNPS common to 2 bins)
1) Mark Obligate Include Snps
    Sequentially scan  list of obligate includes and make sure that they do not create
	a cluster of size #O
    	report statistics
    		position, bin-size it tags, number of other tags in bin, number of population it touches.
    		
     If any error.. abort.

2) Exclude Obligate Exclude Snps
     If an obligate exclude tags a singleton bin, report an error but do not abort
     If an obligate exclude is a single tag and tags one or two other Snps, choose these 1 or 2 other.
       (unless the 2 are within 60bp of each other or other Snps)
     If an obligate exclude bin was a single tag {
     	/// XXX when do k-tag splitting, could do something here.
      	report an error
      }
3) Cluster all SNPs that are within 60bp(parameter) of each other. (see algorithm above)

      
4) // (We'll skip this for now) 
   // look for all permissible k-tags N-umvirates
     Cluster N-umvirates
     Dynamically Explore those solutions by cluster break up. 

5) Run Clustering based on extension on Howie and Carlson.

5) output results.

