BioR: Rapid, Flexible System for Genomic Annotation

BioR (Biological Reference Repository)

A critical problem in applied clinical bioinformatics is rapidly and accurately integrating and prioritizing information from next generation sequencing (NGS) with annotation data spread across hundreds of flat files, databases, tools, and online resources into clinically significant recommendations.  The Biological Reference Repository (BioR) catalog format is a flexible, readable, indexable, and schema-free format for storing and rapidly accessing arbitrary structured data such as genomic features, diseases, conditions, genetic tests, and drugs. The catalogs are modular, based on specific data sources or tools, and can be built and queried independently of others.  This enables additional data sources to be integrated and updated much easier and quicker than with traditional database approaches.  Users can request new data sources be added as catalogs, or supply their own catalogs and add them to the BioR repository.  BioR tools allow users to query this data based on fast coordinate overlaps and indexed keys.  As user files are processed, annotation data is added to the end of the input lines as parseable tab-delimited text consisting of key-value pairs which can be extracted and used as the basis for further operations.

If you encounter trouble with any files on this page, find a bug with the BioR code, or have a feature request, please use this form:

BioR Bug Submission Form


BioR QuickStart 2.2.x

BioR QuickStart 2.1.x

BioR User Guide (see section on Installation instructions)

BioR User Guide 2.2.x

BioR User Guide 2.1.x

(Editable GoogleDoc – note: this may not work well in Internet Explorer)

Catalog descriptions

(Editable GoogleDoc – note: this may not work well in Internet Explorer)

Release Notes

Downloads – Toolkit

(For descriptions of changes, see Release Notes above)

BioR Toolkit (for Linux and Mac OS X)

Source Code – Java

BioR Pipeline (main project) – 2.2.2 (beta), 2.2.1, 2.1.1

Pipes (sub project to handle unix-style pipes in Java) – 2.2.4 (beta), 2.2.3, 2.1.1

BioR Catalog (to create BioR catalogs from source data) – 2.2.2 (beta), 2.2.1, 2.1.1

Maven jars needed to build the projects

(extract this to your user home directory – this will create a “.m2” directory in your home directory)

GitHub source repository

BioR-Supported Catalogs (tar-gzip files)

Small version of the catalogs – Click here if you would like a small version of the entire catalog list (Note: some of the large catalogs are restricted to chromosome 17 (see list of which ones))

  • Partial catalog: All catalogs are required for the bior_annotate command to run, but to get up and running quickly, you may try a sample version of some of the larger catalogs that includes chromosome 17 variants only.
  • Additional catalogs: These are catalogs added by users of the toolkit, and are not guaranteed to be updated or maintained in the future. They are provided as-is.
Data Source Version Build Size Partial catalog Description
dbSNP 137 GRCh37 3.1GB (Sample – contains chr17 variants only) (90MB) The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms
1000 Genomes 2010-11-23 GRCh37 2.2GB (Sample – contains chr17 variants only) (60MB) The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied, by sequencing the genomes of a large number of people
HapMap 2010-08 Phase II+III GRCh37 682MB (Sample – contains chr17 variants only) (17MB) The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation
UCSC (partial catalog -see below for full) hg19 GRCh37 548MB The UCSC Genome Browser provides a rapid and reliable display of any requested portion of genomes at any scale, together with dozens of aligned annotation tracks (known genes, predicted genes, ESTs, mRNAs, CpG islands, assembly gaps and coverage, chromosomal bands, mouse homologies, and more)
ESP build 37 GRCh37 192MB The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders
COSMIC v63 GRCh37 149MB Catalogue Of Somatic Mutations In Cancer is designed to store and display somatic mutation information and related details and contains information relating to human cancers
HGNC 2012-08-12 GRCh37 7.0MB The HUGO Gene Nomenclature Committee (HGNC) has assigned unique gene symbols and names to over 34,000 human loci, of which around 19,000 are protein coding. HGNC is a curated online repository of HGNC-approved gene nomenclature and associated resources including links to genomic, proteomic and phenotypic information, as well as dedicated gene family pages
BGI hg19 GRCh37 5.4MB BGI and Danish researchers have sequenced 200 human exomes of European ancestry from Denmark with an average of 12-fold coverage depth per sample to discover new, low-frequency variants by aggregating data from all 200 individuals
NCBIGene 2013-04-08 GRCh37, patch 10 5.2MB Integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide
OMIM 2013-02-27 GRCh37 1.4MB Online Mendelian Inheritance in Man (OMIM) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. It contains all known mendelian disorders and over 12,000 genes, and focuses on the relationship between phenotype and genotype.
miRBase release 19 GRCh37 59KB The miRBase database is a searchable database of published miRNA sequences and annotation. Each entry in the miRBase Sequence database represents a predicted hairpin portion of a miRNA transcript, with information on the location and sequence of the mature miRNA sequence (termed miR).

Additional Catalogs

Data Source Version Build Size Partial catalog Description
UCSC (full catalogs) hg19 GRCh37 135GB (see description above)

NOTE: Due to the size (135GB) and number (8344) of these files, please contact the BioR team using the form at the top of this page

DrugBank Release 3.0 (2011-01) 43MB DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information
PharmGKB 2013-07-11 4.6MB The PharmGKB is a pharmacogenomics knowledge resource that encompasses clinical information including dosing guidelines and drug labels, potentially clinically actionable gene-drug associations and genotype-phenotype relationships. PharmGKB collects, curates and disseminates knowledge about the impact of human genetic variation on drug responses
Therapeutic Targets Database (TTD) 4.3.02 1.2MB Therapeutic Target Database (TTD) is a database to provide information about the known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets.
UniProt 2013-07-10 2.1GB UniProt provides the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

(coming soon)

2013-07-11 127KB KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The PATHWAY database records networks of molecular interactions in the cells, and variations specific to particular organisms

Page last modified: September 30, 2016