Ensembl 2017

Ensembl (www.ensembl.org) is a database and genome browser for enabling research on vertebrate genomes. We import, analyse, curate and integrate a diverse collection of large-scale reference data to create a more comprehensive view of genome biology than would be possible from any individual dataset. Our extensive data resources include evidence-based gene and regulatory region annotation, genome variation and gene trees. An accompanying suite of tools, infrastructure and programmatic access methods ensure uniform data analysis and distribution for all supported species. Together, these provide a comprehensive solution for large-scale and targeted genomics applications alike. Among many other developments over the past year, we have improved our resources for gene regulation and comparative genomics, and added CRISPR/Cas9 target sites. We released new browser functionality and tools, including improved filtering and prioritization of genome variation, Manhattan plot visualization for linkage disequilibrium and eQTL data, and an ontology search for phenotypes, traits and disease. We have also enhanced data discovery and access with a track hub registry and a selection of new REST end points. All Ensembl data are freely released to the scientific community and our source code is available via the open source Apache 2.0 license.


INTRODUCTION
Over the past several years, large-scale genomics consortia have come together to address key biological questions by creating datasets of sufficient size and scope that they become widely used references. These efforts include the 1000 Genomes Project (1), ENCODE (2), the Gene-Tissue Expression (GTEx) project (3), the Exome Aggregation Consortium (ExAC) (4), the Mouse Genomes Project (5) and the various component projects of the International Human Epigenome Consortium (IHEC). The data and results from these projects have created a strong foundation on which genomics research can build.
The Ensembl project was originally founded to annotate the human genome and has grown into a central hub of genomic information. When a new genome assembly is included in Ensembl, we integrate diverse data to produce a collection of Ensembl resources for gene annotation (6), genome variation (7), gene regulation (8) and comparative genomics (9).
We also develop and distribute a suite of databases, tools (10,11), APIs (12,13) and web interfaces (14) for generating, querying and distributing these data and in doing so D636 Nucleic Acids Research, 2017, Vol. 45, Database issue we ensure consistent data analysis and access for all of our species.
The outputs of the large-scale projects listed above are important components within the overall collection of Ensembl resources. By integrating all of these genomics data resources into a coherent informatics infrastructure we enable further research by simplifying and standardizing the methods for data access and visualization. We also help make these data resources easily accessible to a wide variety of researchers.
Our data and software are updated at regular intervals following a formal release process that ensures data and software provenance tracking via an Ensembl release number. Ensembl release data are archived and can be reliably retrieved into the future. In addition, the release process ensures that data are synchronized across all of Ensembl. For example, updates to the human gene set will trigger updates to the orthologs for all species.
We collaborate with other informatics resources and tools including the Genome Reference Consortium (GRC) (15), the UCSC Genome Browser (16), UniProt (17), model organism databases (18,19) and relevant resources at the NCBI (20) to coordinate data presentation and standards.
We use and support ontological and other standard formats for our data and have worked directly with the Sequence Ontology (SO) to address gaps in the current representations (21). Increasingly, these efforts are taking place in the context of the Global Alliance for Genomics and Health (GA4GH), which works to create interoperable approaches to facilitate genomic data sharing (22). For example, in the past year, we have developed GA4GH-compliant services that offer Ensembl data. These new Ensembl REST endpoints return sequence features, genotype calls, variant annotation, lists of reference sequences and associated metadata in standard GA4GH formats.
In this report we highlight new data and tools for human genome interpretation, with an emphasis on new resources for gene regulation and population genomics. We describe new and updated data for other species, and the accompanying tools and methods for searching, browsing, downloading and analyzing these new features.

Transcriptional regulation
This year, we significantly expanded our catalog of human cell types with evidence-based annotated regulatory elements, which are now available for 68 cell types and tissues as of Ensembl release 86 (October 2016). The increase is largely based on datasets from the IHEC member projects BLUEPRINT (23) and Roadmap Epigenomics (24), which were uniformly annotated using the Ensembl Regulatory Build methodology (25). This process results in a defined location and predicted function for each regulatory element and, for each available cell type or tissue, an activity status such as 'active', 'poised', 'repressed' or 'inactive'. As a result, we now cover a considerable fraction of the epigenomes thus far generated by ENCODE and IHEC, and we will increase our regulatory annotations as more data become available.
We have also recently incorporated expression quantitative trait loci (eQTL) data from GTEx to provide unfiltered SNP-to-gene correlation statistics from 44 tissues (3). This rich dataset can be viewed on our website (Figure 1) and accessed through our REST API, facilitating advanced post-Genome Wide Association Studies (GWAS) functional analysis without the overhead of handling the associated large data files.

Gene annotation and transcript haplotypes
Ensembl's primary gene annotation on the latest human reference assembly, GRCh38, is GENCODE. It was updated regularly over the past year, to include manually annotated transcripts and new gene models on the alternate sequence regions defined by the GRC (26). GENCODE remains the most comprehensive human gene set (27)(28)(29) and this year's updates have also included our presentation of supporting analyses including APPRIS (30), Transcript Support Levels, and the GENCODE Basic set which can be used to identify a subset of the GENCODE transcripts suitable for most applications.
For each GENCODE transcript, we have also calculated the list of observed haplotypes in the 1000 Genomes Project phase 3 data and present these as a series of alterations from the reference sequence for the transcript's coding sequence and protein product. We also provide haplotype frequencies, by population, for each transcript via our new Transcript Haplotype view. To enable further analysis, alignments of the individual haplotypes against the reference assembly are available and the entire set of sequences and metadata can be downloaded in JSON format.

Discovery, prioritization and annotation of sequence variants
We now identify small-scale variants (such as insertions and deletions) as 'equivalent' on our Variant page when they lead to the same alteration to the reference assembly. Equivalent variants can receive separate accession numbers and nominal genomic mappings in databases such as dbSNP when they occur within lower complexity sequence such as dinucleotide repeats. Identifying these variants is particularly useful when one includes frequency information that could also apply to other nearby variants. For example, rs397714540 had no associated frequency data whereas the equivalent variant rs36021200 does have such data from the 1000 Genomes Project.
To aid prioritizing of variants within a gene, enhanced filtering and sorting is now available for the variant tables on our web site. The new tables can manage many hundreds of thousands of rows, and can be customized to display only variants with a range of SIFT (31) or PolyPhen (32) scores; those with particular consequence types or minor allele frequencies; or other properties.
To facilitate data discovery and querying across our various input sources for phenotype, trait and disease annotations--including ClinVar (33), OMIM Morbid (34), the GWAS Catalog (35) and Orphanet (36)--we now map their descriptions onto the Experimental Factor Ontology (EFO) (37), Human Phenotype Ontology (HPO) (38) and Orphanet Rare Disease Ontology. This process helps to rationalize the different descriptions these resources use for similar concepts. By bringing these together, it is now possible to search Ensembl for a disease or phenotype, and to discover variants associated with its synonyms. For example, a search for 'Keratosis follicularis' will now reveal variant rs121912732, which is reported by ClinVar as pathogenic and associated to Darier disease.

Confidence scores and visualization options for homology relationships
We added two new confidence scores to the homology predictions that arise from our TreeFam phylogenetic gene trees (39), which are the basis for inferring homology relationships, including within-species and cross-species events such as gene duplication and gene loss. The first confidence score is based on coverage across all genome sequence alignments, including both pairwise and multiple sequence alignments. This score relies on the assumption that high-quality 'true' orthologs should be well aligned to each other, and it weights alignments over exons more highly than alignments over introns. The second confidence score is based on how well the local (upstream and downstream) gene order is conserved. This score is based on the observation that evolutionary genome rearrangements are likely to happen to a group of contiguous genes, thereby conserving the local gene order surrounding any one gene. Both scores are dis-D638 Nucleic Acids Research, 2017, Vol. 45, Database issue played in the Orthologues table available from each Gene view page. Together, they make it easier to identify highconfidence orthologs by using them alongside the existing filters, such as a threshold on the percentage of sequence identity.
To explore the protein sequence alignments supporting our gene trees, the GeneTree view (also available from each Gene view page) now provides a link to the Wasabi interactive alignment visualization tool (40).

Protein family classification
To quickly and accurately infer the function of genes in newly sequenced genomes, we have created a new Hidden Markov Model (HMM) library for matching new protein sequences to existing, well-studied proteins from other species. This HMM library uses the PANTHER families as a base, is supplemented with our own data, and has been defined across all eukaryote genomes, including non-vertebrates in Ensembl Genomes. This HMM library is available for download (ftp://ftp.ensembl.org/pub/ current compara), and provides a stable and scalable means to classify new protein sequences into our protein families resource.

Mouse strain genomes
Whole genome sequencing of key laboratory mouse strains has been ongoing over the last several years (5,41). Following a transition of the Mouse Genomes Project from a resequencing to de novo assembly strategy for a core set of 16 inbred mouse strains and subspecies, these mouse assemblies now fit Ensembl's data model and were introduced in Ensembl release 86 (October 2016) (Figure 2). We have annotated repeats, CpG islands, and promoter regions on these assemblies. Gene annotation for the 16 strain assemblies is provided directly by the Mouse Genomes Project using a process of whole genome alignments, annotation projection and various filters. We aligned UniProt proteins and annotated protein features on the protein coding transcripts. We also computed rodent-specific phylogenetic trees ('gene trees') on the protein coding genes, and inferred orthologs and paralogs from them.
In contrast to the annotation for the 16 mouse strains, the gene annotation for the C57BL/6J reference mouse genome assembly, GRCm38, is produced by GENCODE. The mouse GENCODE annotation has been updated several times this year and combines the standard Ensembl gene annotation approach (6) with manual annotation directly on the reference assembly.

Updated chicken genome assembly and annotation
Our chicken resources were updated to the latest chicken assembly, Gallus gallus-5.0 (GCA 000002315.3), in Ensembl release 86 (October 2016). In a first for any species in Ensembl, we incorporated PacBio Iso-Seq data from brain and embryo libraries to support annotation of alternate splicing. These data supplemented the standard collection of evidence used for annotation including, in this case, protein sequences, cDNA sequences, and Illumina RNA-seq data from 20 different tissues. As with all cases when we update a species to a new assembly, we propagated gene stable identifiers from the old assembly to ensure consistency across the assembly update. All comparative genomics resources for chicken were also updated including the relevant TreeFam gene trees and homology (ortholog and paralog) annotation based on the updated gene annotation and our pairwise whole genome alignments from chicken to 12 other species, including seven birds. Our sauropsid Enredo Pecan Ortheus (EPO) alignments (42,43), and our amniote Mercator Pecan multiple alignments (42,44) were fully recomputed to include the new chicken assembly.

Annotation for other species
The zebrafish and rat gene sets have both been updated to include manual annotation from HAVANA (45). We annotated additional gene models for zebrafish based on RNA-seq data taken from the embryo at six hours postfertilization and 24 hours post-fertilization.
Annotation for rhesus macaque and mouse lemur was updated to include the latest assemblies, Mmul 8.0.1 (GCA 000772875.3) and Mmur 2.0 (GCA 000165445.2), respectively. For both primates, we annotated gene models using an improved version of our gene annotation system that produces more transcript variants per gene than the previous version. We also updated the TreeFam gene trees, homology annotation, and pairwise whole genome alignments to human as well as our primates and mammals EPO multiple alignments to include both new primate assemblies.

Variant Effect Predictor
The Ensembl Variant Effect Predictor (VEP) is a tool for annotating and prioritizing genomic variants, and relies on our comprehensive and up-to-date data (46). Significant improvements this year include speed and memory optimizations. We have also implemented powerful new filtering options for the VEP results, including support for nested filters. For example, the following filtering statement is now possible: GMAF < 0.1 and ((Consequence is missense variant and (SIFT is deleterious or PolyPhen is probably damaging)) or Consequence match stop) To better support RefSeq transcripts (47), VEP now reports information on matched regions between Ensembl and RefSeq transcripts and mismatches between RefSeq transcripts and the reference genome assembly (46).
This year has seen us release new and updated plugins for the VEP, and we continue to encourage the community to submit their VEP plugins to our dedicated GitHub repository (https://github.com/Ensembl/VEP plugins). To further promote the re-use of these plugins, we have added functionality so that VEP plugins can be run via our website or using our REST API. The offline script version has also been updated to support output of conservation scores and ExAC frequency data (4).

Population genomics
We have improved the access methods for linkage disequilibrium (LD) data by developing a faster and more robust RESTful API and Perl API method to retrieve LD values between a specific pair of SNPs. We use this method ourselves to display LD values as a Manhattan plot accessible from the Variant pages.
We have also migrated three tools to support genome variation analysis that were previously only available on the 1000 Genomes Brower (48). The Allele Frequency Calculator determines population-wide allele frequencies for sites within the chromosomal region defined from a VCF file and populations defined in a sample file. The VCF to PED Converter transforms a VCF file to a linkage pedigree (PED) file and a marker information file, which together may be loaded into linkage disequilibrium display tools such as Haploview (49). The Variant Pattern Finder identifies shared variation between individuals in a chromosomal region of interest. These tools use data from the 1000 Genomes Project phase 1 and 3 studies, and are currently only available on our GRCh37 archive site. All tools can be accessed via the Tools link at the top of each Ensembl page.

CRISPR/Cas9 target regions
The CRISPR/Cas9 system has recently inspired a new array of laboratory techniques for targeted genome editing, knock-out screens and functional assays. Short single guide RNA molecules (sgRNA) are used to lead the enzyme to precise genomic locations. However, like PCR primers, not all regions of the genome are as readily accessible and sgRNA sequences with few off-targets sites are more likely to be specific in their binding. To assist experimental design, we annotated the human and mouse genomes with all possible CRISPR/Cas9 single guide RNA binding sites in a new 'WGE CRISPR sites' track on our browser's Location view (Region in Detail). Each site can be clicked separately to reveal an information window with specificity statistics produced by the Wellcome Trust Sanger Institute Genome Editing group (50).

Track Hub Registry
Ensembl has supported display of external datasets stored in track data hubs since 2013 (51) and watched them develop into a popular method for many projects to organize, share and display genome-wide datasets (52). Widespread use of track hubs has made finding relevant data increasingly difficult. To address this, we have designed the Track Hub Registry (http://www.trackhubregistry.org) to catalog and search publicly accessible track hubs. Hubs can be searched and attached via the Track Hub Registry website or from a specialized search from our custom data interfaces.

File Chameleon
We have developed the File Chameleon tool to help address the perennial bioinformatics problem of ensuring that input files match the format specified by a specific software package. For example, some analysis software requires the 'chr' string at the start of a chromosome name, or will not allow genes longer than 2Mb. Pre-processing the input files is time-consuming, requires domain knowledge and could lead to errors. File Chameleon makes downloading customized versions of the files on our FTP site easy. Instead of searching our FTP site, the dataset and format requirements are provided to File Chameleon, which will then produce the correctly formatted files for download. Access to the online version of File Chameleon is at http://www.ensembl. org/Homo sapiens/Tools/FileChameleon; it is also available as a standalone script (https://github.com/FAANG/faangformat-transcriber) so it can be run locally on any file.

TRAINING, OUTREACH AND USER SUPPORT
We offer extensive in-person training (http://training. ensembl.org) as well as online courses, live webinars, YouTube tutorials (https://www.youtube.com/user/ EnsemblHelpdesk) and static text-based courses. This year saw the first iteration of our live online course (http://www.ebi.ac.uk/training/online/course/ensemblbrowser-webinar-series-2016/), consisting of a series of seven live webinars on using the Ensembl website, with accompanying exercises and catch-up videos on the EBI's Train Online platform.
Queries about hosting Ensembl workshops and any other questions about Ensembl can be directed to our helpdesk (helpdesk@ensembl.org). We can also be contacted informally via social media platforms, including Twitter (@ensembl) and Facebook (Ensembl.org). Our blog posts include detailed descriptions of every Ensembl release and other information (http://www.ensembl.info).

CONCLUSION
Ensembl is a central hub of genomic data that creates and presents high-quality reference datasets in a consistent, accessible infrastructure. Among other updates, over the past year we have expanded our human genome resource with extensive regulatory data and major external datasets and included 16 new mouse strain assemblies. In response to increasing data size and complexity, we expanded our tools and methods for searching, filtering and prioritizing data. New and updated genomes, annotation, datasets and tools are part of every Ensembl release. We believe these efforts will ensure that Ensembl remains a valuable source of data and tools for interpreting biology on assembled genome sequences.