Department of Quantitative Health Sciences

BioR: Rapid, Flexible System for Genomic Annotation

BioR (Biological Reference Repository)

A critical problem in applied clinical bioinformatics is rapidly and accurately integrating and prioritizing information from next generation sequencing (NGS) with annotation data spread across hundreds of flat files, databases, tools, and online resources into clinically significant recommendations. The Biological Reference Repository (BioR) catalog format is a flexible, readable, indexable, and schema-free format for storing and rapidly accessing arbitrary structured data such as genomic features, diseases, conditions, genetic tests, and drugs. The catalogs are modular, based on specific data sources or tools, and can be built and queried independently of others. This enables additional data sources to be integrated and updated much easier and quicker than with traditional database approaches. Users can request new data sources be added as catalogs, or supply their own catalogs and add them to the BioR repository. BioR tools allow users to query this data based on fast coordinate overlaps and indexed keys. As user files are processed, annotation data is added to the end of the input lines as parseable tab-delimited text consisting of key-value pairs which can be extracted and used as the basis for further operations.

If you encounter trouble with any files on this page, find a bug with the BioR code, or have a feature request, please use this form:

BioR Bug Submission Form

Documentation

BioR QuickStart 2.2.x

BioR QuickStart 2.1.x

BioR User Guide (see section on Installation instructions)

BioR User Guide 2.2.x

BioR User Guide 2.1.x

(Editable GoogleDoc – note: this may not work well in Internet Explorer)

Catalog descriptions

(Editable GoogleDoc – note: this may not work well in Internet Explorer )

Release Notes

Downloads – Toolkit

(For descriptions of changes, see Release Notes above)

BioR Toolkit (for Linux and Mac OS X)

Source Code – Java

BioR Toolkit (main project)

Pipes (sub project to handle unix-style pipes in Java) – 2.2.4 (beta), 2.2.3, 2.1.1

BioR Catalog (to create BioR catalogs from source data) – 2.2.2 (beta), 2.2.1, 2.1.1

Maven jars needed to build the projects

(extract this to your user home directory – this will create a “.m2” directory in your home directory)

GitHub source repository

BioR-Supported Catalogs (tar-gzip files)

Small version of the catalogs – Click here if you would like a small version of the entire catalog list (Note: some of the large catalogs are restricted to chromosome 17 (see list of which ones))

Partial catalog: All catalogs are required for the bior_annotate command to run, but to get up and running quickly, you may try a sample version of some of the larger catalogs that includes chromosome 17 variants only.
Additional catalogs: These are catalogs added by users of the toolkit, and are not guaranteed to be updated or maintained in the future. They are provided as-is.

Data Source	Version	Build	Size	Partial catalog	Description
Reference Assemblies	N/A	GRCh37, GRCh38	1.6GB		Reference assemblies containing 100kbp sequences, with tabix indexes, plus chromosome sizes and sort orders
dbSNP	137	GRCh37	3.1GB	(Sample – contains chr17 variants only) (90MB)	The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms
1000 Genomes	2010-11-23	GRCh37	2.2GB	(Sample – contains chr17 variants only) (60MB)	The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied, by sequencing the genomes of a large number of people
HapMap	2010-08 Phase II+III	GRCh37	682MB	(Sample – contains chr17 variants only) (17MB)	The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation
UCSC (partial catalog -see below for full)	hg19	GRCh37	548MB		The UCSC Genome Browser provides a rapid and reliable display of any requested portion of genomes at any scale, together with dozens of aligned annotation tracks (known genes, predicted genes, ESTs, mRNAs, CpG islands, assembly gaps and coverage, chromosomal bands, mouse homologies, and more)
ESP	build 37	GRCh37	192MB		The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders
COSMIC	v63	GRCh37	149MB		Catalogue Of Somatic Mutations In Cancer is designed to store and display somatic mutation information and related details and contains information relating to human cancers
HGNC	2012-08-12	GRCh37	7.0MB		The HUGO Gene Nomenclature Committee (HGNC) has assigned unique gene symbols and names to over 34,000 human loci, of which around 19,000 are protein coding. HGNC is a curated online repository of HGNC-approved gene nomenclature and associated resources including links to genomic, proteomic and phenotypic information, as well as dedicated gene family pages
BGI	hg19	GRCh37	5.4MB		BGI and Danish researchers have sequenced 200 human exomes of European ancestry from Denmark with an average of 12-fold coverage depth per sample to discover new, low-frequency variants by aggregating data from all 200 individuals
NCBIGene	2013-04-08	GRCh37, patch 10	5.2MB		Integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide
OMIM	2013-02-27	GRCh37	1.4MB		Online Mendelian Inheritance in Man (OMIM) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. It contains all known mendelian disorders and over 12,000 genes, and focuses on the relationship between phenotype and genotype.
miRBase	release 19	GRCh37	59KB		The miRBase database is a searchable database of published miRNA sequences and annotation. Each entry in the miRBase Sequence database represents a predicted hairpin portion of a miRNA transcript, with information on the location and sequence of the mature miRNA sequence (termed miR).

Additional Catalogs

Data Source	Version	Build	Size	Description
UCSC (full catalogs)	hg19	GRCh37	135GB	(see description above) NOTE: Due to the size (135GB) and number (8344) of these files, please contact the BioR team using the form at the top of this page
DrugBank	Release 3.0 (2011-01)		43MB	DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information
PharmGKB	2013-07-11		4.6MB	The PharmGKB is a pharmacogenomics knowledge resource that encompasses clinical information including dosing guidelines and drug labels, potentially clinically actionable gene-drug associations and genotype-phenotype relationships. PharmGKB collects, curates and disseminates knowledge about the impact of human genetic variation on drug responses
Therapeutic Targets Database (TTD)	4.3.02		1.2MB	Therapeutic Target Database (TTD) is a database to provide information about the known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets.
UniProt	2013-07-10		2.1GB	UniProt provides the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.
KEGG (coming soon)	2013-07-11		127KB	KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The PATHWAY database records networks of molecular interactions in the cells, and variations specific to particular organisms

Page last modified: November 17, 2023

Software Packages

BioR: Rapid, Flexible System for Genomic Annotation

BioR (Biological Reference Repository)

Documentation

Downloads – Toolkit

Source Code – Java

BioR-Supported Catalogs (tar-gzip files)

Other Topics in Research

Mayo Clinic Footer

Legal Conditions and Terms

Advertising

Reprints