EmAtlas: a comprehensive atlas for exploring spatiotemporal activation in mammalian embryogenesis

Abstract The emerging importance of embryonic development research rapidly increases the volume for a professional resource related to multi-omics data. However, the lack of global embryogenesis repository and systematic analysis tools limits the preceding in stem cell research, human congenital diseases and assisted reproduction. Here, we developed the EmAtlas, which collects the most comprehensive multi-omics data and provides multi-scale tools to explore spatiotemporal activation during mammalian embryogenesis. EmAtlas contains data on multiple types of gene expression, chromatin accessibility, DNA methylation, nucleosome occupancy, histone modifications, and transcription factors, which displays the complete spatiotemporal landscape in mouse and human across several time points, involving gametogenesis, preimplantation, even fetus and neonate, and each tissue involves various cell types. To characterize signatures involved in the tissue, cell, genome, gene and protein levels during mammalian embryogenesis, analysis tools on these five scales were developed. Additionally, we proposed EmRanger to deliver extensive development-related biological background annotations. Users can utilize these tools to analyze, browse, visualize, and download data owing to the user-friendly interface. EmAtlas is freely accessible at http://bioinfor.imu.edu.cn/ematlas.


For transcriptome data
The filtered reads of RNA-seq were aligned to the GRCh37 human and GRCm38 mouse reference genome (NCBI) using STAR (v2.7.10a) [1] , respectively. Samtools (v1.15.1) [2] was used to convert SAM files into binary BAM files. Then, the BAM files of each sample were used to quantify the expression levels by Salmon (v1.9.0) [3] . The individual quantification files were merged and normalized by the customize script. The gene identifier conversion was performed based on the gene annotation information of this study, including MGI, HGNC, UniProt, Ensemble, Entrez, etc.
For bulk RNA-seq datasets, differential expression analysis was performed by R package DEseq2 [5] . For each comparison, genes with P-value < 0.05 and Log2FC > 1 were regarded as differential expression genes (DEGs) [6] . All samples of each analysis task will undergo dimension reduction and clustering analysis. Finally, the correct analysis results will be imported into the resource database, and the error information will be returned to log files.
For single-cell RNA-seq (scRNA-seq) datasets, scanpy [7] was used to conduct differential expression analysis. This strategy will detect highly variable genes (HVGs) in different tissue types, cell clusters and unsupervised clusters. When detecting different cell clusters, the strategy first used principal component analysis (PCA) to reduce the dimension, and the detected principal components were taken as input to perform UMAP (uniform manifold approximation and projection) and t-SNE (t-distributed stochastic neighborhood embedding) analysis. These dimension reduction information will also be stored in the EmAtlas. In order to accurately identify biomarker genes in mammalian development, three ways were used to annotate candidate genes: differential genes captured by bulk RNA-seq, HVGs detected by scRNA-seq, and known biomarkers obtained by manual curation of PubMed (https://pubmed.ncbi.nlm.nih.gov/) literature (Review Genes).

For chromatin accessibility, histone modification and transcription factors binding data
This pipeline is applicable to ATAC-seq data of chromatin accessibility, ChIP-seq data of histone modifications (HMs) and transcription factors (TFs) binding data.
After the peak signal was normalized, the pipeline can calculate three evaluation indicators of the epigenetic signals in different genomic regions (such as the promoter, untranslated region (UTR), coding sequence (CDS), first exon, etc.). These three evaluation indicators were modification signal area, single base modification value, and modification coverage percentage.

For DNA methylation data
This pipeline is applicable to bisulfite sequencing (BS-seq) of DNA methylation modifications and other omics data generated by similar sequencing technologies.
The filtered reads were aligned to the GRCh37 human and GRCm38 mouse reference genome (NCBI) using Bismark (v0.23.0) [12] . The SAM files obtained by alignment step were converted into BAM files by Samtools (v1.15.1). Then, BAM files were converted into BigWig files using Bedtools (v2.30.3) [10] and WigtoBigWig tool [11] . Finally, the normalization and epigenetic signals calculation modules were consistent with the above process by the customize scripts. Figure S1. The workflow of multi-omics analysis management system based on Nextflow.