BioR: Rapid, Flexible System for Genomic Annotation
BioR (Biological Reference Repository)
A critical problem in applied clinical bioinformatics is rapidly and accurately integrating and prioritizing information from next generation sequencing (NGS) with annotation data spread across hundreds of flat files, databases, tools, and online resources into clinically significant recommendations. The Biological Reference Repository (BioR) catalog format is a flexible, readable, indexable, and schema-free format for storing and rapidly accessing arbitrary structured data such as genomic features, diseases, conditions, genetic tests, and drugs. The catalogs are modular, based on specific data sources or tools, and can be built and queried independently of others. This enables additional data sources to be integrated and updated much easier and quicker than with traditional database approaches. Users can request new data sources be added as catalogs, or supply their own catalogs and add them to the BioR repository. BioR tools allow users to query this data based on fast coordinate overlaps and indexed keys. As user files are processed, annotation data is added to the end of the input lines as parseable tab-delimited text consisting of key-value pairs which can be extracted and used as the basis for further operations.
If you encounter trouble with any files on this page, find a bug with the BioR code, or have a feature request, please use this form:
Documentation
BioR User Guide (see section on Installation instructions)
(Editable GoogleDoc – note: this may not work well in Internet Explorer)
(Editable GoogleDoc – note: this may not work well in Internet Explorer)
Downloads – Toolkit
(For descriptions of changes, see Release Notes above)
BioR Toolkit (for Linux and Mac OS X)
- v5.1.4
- v5.1.0
- v5.0.0
- v4.1.2
- v2.4.2
- Old Versions (contain a critical tabix bug and should NOT be used unless necessary)
Source Code – Java
BioR Toolkit (main project)
Pipes (sub project to handle unix-style pipes in Java) – 2.2.4 (beta), 2.2.3, 2.1.1
BioR Catalog (to create BioR catalogs from source data) – 2.2.2 (beta), 2.2.1, 2.1.1
Maven jars needed to build the projects
(extract this to your user home directory – this will create a “.m2” directory in your home directory)
BioR-Supported Catalogs (tar-gzip files)
Small version of the catalogs – Click here if you would like a small version of the entire catalog list (Note: some of the large catalogs are restricted to chromosome 17 (see list of which ones))
- Partial catalog: All catalogs are required for the bior_annotate command to run, but to get up and running quickly, you may try a sample version of some of the larger catalogs that includes chromosome 17 variants only.
- Additional catalogs: These are catalogs added by users of the toolkit, and are not guaranteed to be updated or maintained in the future. They are provided as-is.
Data Source | Version | Build | Size | Partial catalog | Description |
---|---|---|---|---|---|
Reference Assemblies | N/A | GRCh37, GRCh38 | 1.6GB | Reference assemblies containing 100kbp sequences, with tabix indexes, plus chromosome sizes and sort orders | |
dbSNP | 137 | GRCh37 | 3.1GB | (Sample – contains chr17 variants only) (90MB) | The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms |
1000 Genomes | 2010-11-23 | GRCh37 | 2.2GB | (Sample – contains chr17 variants only) (60MB) | The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied, by sequencing the genomes of a large number of people |
HapMap | 2010-08 Phase II+III | GRCh37 | 682MB | (Sample – contains chr17 variants only) (17MB) | The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation |
UCSC (partial catalog -see below for full) | hg19 | GRCh37 | 548MB | The UCSC Genome Browser provides a rapid and reliable display of any requested portion of genomes at any scale, together with dozens of aligned annotation tracks (known genes, predicted genes, ESTs, mRNAs, CpG islands, assembly gaps and coverage, chromosomal bands, mouse homologies, and more) | |
ESP | build 37 | GRCh37 | 192MB | The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders | |
COSMIC | v63 | GRCh37 | 149MB | Catalogue Of Somatic Mutations In Cancer is designed to store and display somatic mutation information and related details and contains information relating to human cancers | |
HGNC | 2012-08-12 | GRCh37 | 7.0MB | The HUGO Gene Nomenclature Committee (HGNC) has assigned unique gene symbols and names to over 34,000 human loci, of which around 19,000 are protein coding. HGNC is a curated online repository of HGNC-approved gene nomenclature and associated resources including links to genomic, proteomic and phenotypic information, as well as dedicated gene family pages | |
BGI | hg19 | GRCh37 | 5.4MB | BGI and Danish researchers have sequenced 200 human exomes of European ancestry from Denmark with an average of 12-fold coverage depth per sample to discover new, low-frequency variants by aggregating data from all 200 individuals | |
NCBIGene | 2013-04-08 | GRCh37, patch 10 | 5.2MB | Integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide | |
OMIM | 2013-02-27 | GRCh37 | 1.4MB | Online Mendelian Inheritance in Man (OMIM) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. It contains all known mendelian disorders and over 12,000 genes, and focuses on the relationship between phenotype and genotype. | |
miRBase | release 19 | GRCh37 | 59KB | The miRBase database is a searchable database of published miRNA sequences and annotation. Each entry in the miRBase Sequence database represents a predicted hairpin portion of a miRNA transcript, with information on the location and sequence of the mature miRNA sequence (termed miR). |
Additional Catalogs
Data Source | Version | Build | Size | Partial catalog | Description |
---|---|---|---|---|---|
UCSC (full catalogs) | hg19 | GRCh37 | 135GB | (see description above)
NOTE: Due to the size (135GB) and number (8344) of these files, please contact the BioR team using the form at the top of this page |
|
DrugBank | Release 3.0 (2011-01) | 43MB | DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information | ||
PharmGKB | 2013-07-11 | 4.6MB | The PharmGKB is a pharmacogenomics knowledge resource that encompasses clinical information including dosing guidelines and drug labels, potentially clinically actionable gene-drug associations and genotype-phenotype relationships. PharmGKB collects, curates and disseminates knowledge about the impact of human genetic variation on drug responses | ||
Therapeutic Targets Database (TTD) | 4.3.02 | 1.2MB | Therapeutic Target Database (TTD) is a database to provide information about the known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets. | ||
UniProt | 2013-07-10 | 2.1GB | UniProt provides the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. | ||
KEGG
(coming soon) |
2013-07-11 | 127KB | KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The PATHWAY database records networks of molecular interactions in the cells, and variations specific to particular organisms |
Page last modified: November 17, 2023