The on-premise data sharing infrastructure e!DAL: Foster FAIR data for faster data acquisition

Abstract Background The FAIR data principle as a commitment to support long-term research data management is widely accepted in the scientific community. Although the ELIXIR Core Data Resources and other established infrastructures provide comprehensive and long-term stable services and platforms for FAIR data management, a large quantity of research data is still hidden or at risk of getting lost. Currently, high-throughput plant genomics and phenomics technologies are producing research data in abundance, the storage of which is not covered by established core databases. This concerns the data volume, e.g., time series of images or high-resolution hyper-spectral data; the quality of data formatting and annotation, e.g., with regard to structure and annotation specifications of core databases; uncovered data domains; or organizational constraints prohibiting primary data storage outside institional boundaries. Results To share these potentially dark data in a FAIR way and master these challenges the ELIXIR Germany/de.NBI service Plant Genomic and Phenomics Research Data Repository (PGP) implements a “bring the infrastructure to the data” approach, which allows research data to be kept in place and wrapped in a FAIR-aware software infrastructure. This article presents new features of the e!DAL infrastructure software and the PGP repository as a best practice on how to easily set up FAIR-compliant and intuitive research data services. Furthermore, the integration of the ELIXIR Authentication and Authorization Infrastructure (AAI) and data discovery services are introduced as means to lower technical barriers and to increase the visibility of research data. Conclusion The e!DAL software matured to a powerful and FAIR-compliant infrastructure, while keeping the focus on flexible setup and integration into existing infrastructures and into the daily research process.


Full Title:
The on-premise data sharing infrastructure e!DAL: Foster FAIR data for faster data acquisition The FAIR data principle as a commitment to support long-term research data management is widely accepted in the scientific community. Although the ELIXIR Core Data Resources and other established infrastructures provide comprehensive and long-term stable services and platforms for FAIR data management, a notifiable number of research data is still hidden or under risk of getting lost. Currently, highthroughput plant genomics and phenomics technologies are producing research data in abundance, the storage of which is not covered by established core databases. This concerns the data volume, e.g. time series of images or high-resolution hyper-spectral data, the quality of data formatting and annotation, e.g. with regard to structure and annotation specifications of core databases, uncovered data domains or organizational constraints to not primarily store data outside the institional boundaries. In order to share these potentially dark data in a FAIR way and master these challenges the ELIXIR Germany/de.NBI service Plant Genomic and Phenomics Research Data Repository (PGP) implements a "bring the infrastructure to the data" approach which allows to keep the research data in place and wrap them by a FAIR-aware software infrastructure. This article presents new features of the PGP infrastructure as a best practice on how to easily set up FAIR-compliant and intuitive research data services. Furthermore, the integration of the ELIXIR Authentication and Authorization Infrastructure (AAI) and data discovery services are introduced as means to lower technical barriers and to increase the visibility of research data.

Introduction
The FAIR (Findability, Accessibility, Interoperability, and Reusability) principles, drafted by the FORCE11 workgroup in 2015 [1] and published in 2016 by Wilkinson et al. [2] are widely accepted and are increasingly adopted in the management policies of research data. The scienti c community is showing a rising awareness of the scienti c value of reusable research data.This has already resulted in FAIR principles being formally accepted in several data management guidelines, e.g. in the Horizon2020 program [3] of the European Commission, and integrated into the research funding policy [4,5]. Its technical implementation is supported by data repositories, which store and share research data in a FAIR manner. Those can be classi ed into (i) general purpose data repositories, e.g. gshare [6], Zenodo [7], Dryad [8] and FAIRDOM [9], (ii) core data deposition databases, i.e. ELIXIR deposition databases for life science data [10], NCBI database resources [11] and (iii) speci c databases and repositories hosted by research institutes. All have in common that the research data has to be transferred by its owner from the place of data generation to these repositories. This involves considerable e ort for data compilation, cleansing, homogenisation, metadata enrichment, formatting and upload. As a result, the published datasets are condensed and generally limited to insu ciently documented supplement material for publications in scienti c journals. In the case that data should be submitted to database systems, e.g. the EBI and NCBI core data resources, Bioinformaticians are charged and trained to meet the speci ct submission requirements and support biologists. Examples are the preparation of data for submission to the EBI ENA archive [12,13], the European Variation Archive (EVA) [14] or the preparation of ISA-TAB compatible data submission for plant phenotyping data [15,16]. Alternatively, institutes could set up project-related data repositories. This in turn requires skilled technicians and computer scientists as well as long-term access to appropriate network and storage infrastructure. Such repositories frequently have a short life time, whether due to sta uctuation, long-term maintenance costs and resource consumption. Another reason may be that the repository's niche is too speci c to attract substantial data volume, which in turn strongly depends on policies and cost-bene t considerations.
Thus, there is a need for an additional class of repositories that support the data sharing for this class of research data by moving the infrastructure to the data. The concept is to apply an on-premise, infrastructure-to-the-data (I2D) principle. The basic idea of the I2D approach is shown in Figure  1. In contrast to the conventional data publication pipelines to journal accepted databases, which usually involves a timeconsuming data upload to an external platform and possibly additional costs depending on the required storage space, the underlying e!DAL software [17] encapsulates an existing storage infrastructure by a data publication layer. This layer is a broker to the DataCite [18] data publication service agent and provides an API and a tooling infrastructure for data submission, DOI delivery, reporting and data quality reviewing. This nally enables the assignment of DOIs with a minimal set of technical metadata, which are based on the DublinCore, to inhouse stored data and its approved FAIR referencing by journals or data lookup services.
As proof of concept, the Plant Genomics and Phenomics Data Repository (PGP) was implemented [19] to publish digital plant genetic resources (PGR) [20] according to the FAIR principles. PGRs are the basis of food security and consist of diversity of seeds and planting materials of modern cultivars and crop wild relatives [21]. Approximately seven million PGR accessions are conserved in genebank collections worldwide. The valorisation of PGRs through genotyping and phenotyping is of special focus in the public and private sectors [22,23]. The data management of digital PGRs is identi ed as one of the most important challenges for a long-term strategy to enhance the productivity, sustainability and resilience of crop varieties and agricultural systems. In contrast to successful studies on genomics-assisted genebank management and the utilization of germplasm collections [22], the special focus of the PGP repository is the publication of buckets of research data that do not t into general purpose sharing platform or core data deposition databases due to their volume, objective, structure or incomplete analysis. Examples are primary data from imaging, eld phenotyping, SNP matrices, 3D-plant models, metabolite screenings and environmental sensor data. The experience gained during the four-year operation of the repository has led to a growing acceptance of this approach for the publication of digital PGRs collected in the context of the German Federal ex situ Genebank of Agricultural and Horticultural Crop Species [24]. This experience and the adding to the list of service in the European life-sciences Infrastructure for biological Information ELIXIR [25] resulted in novel features, which were implemented with the aim of further improving its acceptance and enabling increased sharing of digital PGRs. After an update to the state of the art, the new features of the e!DAL data sharing software and its application for the publication of digital PGRs will be explained.

Related Work
Just like we have many di erent data types from several domains, there are also a variety of domain-speci c archives and information systems. Most of them evolved over many years and they are widely accepted by the research community [26], ENA for genomic data [27], UniProt for protein data [28], PRIDE for proteome data [29], BioModels for system biology data [30] and many more. As a guideline, research journals and other publishers require the sustainable publication of data according to FAIR criteria. For this purpose, established domain speci c databases or the use of long term committed data repositories is recommended. In order to not get lost in the diversity of archives, there are several registries like re3data.org or FAIRsharing.org, as well as consortia like GFBio which collect and categorize repositories to help researchers nding the suitable storage for their data.
Infrastructure programs like the European Open Science Cloud (EOSC) [4], and the European life-sciences Infrastructure for biological Information (ELIXIR) [31] coordinate maintenance and interoperability of research data repositories as federated services by member organisations and hosting institutions. Furthermore, the ELIXIR organisation has the aim to establish a stable and sustainable infrastructure for biological information. In doing so, they de ne important core resources and deposition databases as a support of the research community [10] like BRENDA [32] or SILVA [33]. Most of these systems accept only very speci c datasets and require specialised metadata based on schemes that have been improved by the community over years. Unfortunately, there are several, mostly relatively new data types, e.g. plant phenotypic data, which currently do not t into existing databases, mainly because of their strong heterogeneity and high volume. Public data sharing ser-  vices like gshare or DRYAD provide an alternative solution for publishing these datasets. They are easy to use and have a comprehensive functionality like supporting version controlling and the assigning of persistent identi ers. One important de ciency of such services is the limited free space, which is usually enough for sharing some reduced graphics or aggregated tables, but not for storing large datasets. Furthermore, the establishment and con guration of an own in-house infrastructure based on existing software packages like CKAN or Dataverse could overcome this shortcoming, but they require quite a lot of technical prerequisites and know-how.

Infrastructure
To lower the technical barriers and minimize the e ort for scientists to archive and share their research data, we developed the generic e!DAL software infrastructure [17]. The usual "Data Publication as-a-Service" procedure includes the transfer of selected datasets to external databases and storage infrastructures after data generation and analysis. In this way research data can be referenced in a future research publication, as show on the left side of Figure 1 (A). In contrast the e!DAL infrastructure provide a "Data Publication on-Premise" approach which enables the publication of locally stored high voluminous research data through the assignment of widely accepted and long-term stable Digital Object Identi ers (DOIs). This is illustrated on the right side of Figure 1 (B). Using DOIs for referencing provides multiple advantages for sharing and accessing research data. Beside adding them as supplements to a research article they can also be the basis for a comprehensive data paper [34]. Furthermore, the well connected infrastructure of the DataCite consortia strongly increase the visibility of the research data assigned with a DOI. It is automatically linked with the ORCID account of the authors, can be found via the DataCite Search and other common search engines or can be harvested via OAI-PMH interface.  [19] as a powerful infrastructure for the pub-lication of comprehensive plant genomics and phenomics research data. The repository covers in particular cross-domain datasets, which are not being published in public repositories for reasons of data volume or data domain, such as phenotyping images, genotyping data, visualizations of morphological models, data from mass spectrometry as well as software and related documents. Doing so, PGP currently provides 200 data records, which can be referenced via DOIs and are annotated with technical metadata. These records comprise more than 1.4 million les with an overall volume of over 2.6 terabytes (see Figure 3). To ensure data discoverability, PGP provides landing pages with JSON-LD formatted metadata and is therefore discoverable through data web crawler services, which follow the schema.org recommendations, such as Google, Microsoft, Yandex etc. Furthermore, e!DAL implements the OAI-PMH Protocol for Metadata Harvesting from Open Archives Initiative. To support scientists to disseminate their research data the PGP infrastructure is accepted as institutional repository for the Scienti c Data (Nature Publishing Group) and the GigaScience (Oxford Academic) Journals and is registered in re3data.org, FAIRsharing.org, OpenAIRE and DataCite.
The bene ts of this wide support of data discovery enabling technologies and data publication in general is proven by the steadily increasing number of dataset accesses. By June 2020, PGP delivered 300 terabytes of data and the provided datasets have been accessed by 100,000 unique clients.

Improvements
The following section will sum up the main enhancements and updates of the e!DAL infrastructure which comprise new general features, comprehensive changes of several frontend components and important performance improvements. Furthermore, an extensive update due to the latest changes in the Java programming languages and an improved build and deployment process is described.

Performance
After releasing the rst productive version of the PGP repository in 2015, we received many diverse data submissions from several research domains and with very heterogeneous data les. Since then we recognized that the e!DAL infrastructure software scales very well and is able to handle millions of data les, which con rms previous calculations and performance tests [17]. But furthermore it became apparent that sometimes the performance decreases, e.g. for uploading comprehensive datasets with several hundred thousands of small les. Since this is a very common case, e.g. for plant phenotyping datasets, an improvement of the implementation of the e!DAL infrastructure was necessary. Some major performance improvements are described subsequently.
One important feature of e!DAL is the automatic calculation of several essential technical metadata, like the MIME-Type, the data volume or the checksum of every le when storing new datasets. This is convenient , because the user does not need to provide these information on his/her own, but of course these computations are a resource-and runtime-intensive process. Therefore, the functionality to determine the previously mentioned technical metadata and the procedure to transfer the actual binary data have been improved towards a parallel processing of multiple les. This results in a better performance especially on today's multi-core systems. Furthermore, we optimized several settings for the streaming bu er size and the remote transfer to improve the memory usage and the upload performance for the case of numerous small les. Additionally the checksum calculation was updated to use the more collision resistant SHA-256 algorithm, instead of the older and unsecure MD5 function.

New Features
The previous version of the e!DAL infrastructure already fullled several recommendations of the FAIR data principles, such as the support of standardized metadata based on the DublinCore schema or the provision of persistent DOIs for accessing and referencing of research datasets. The e!DAL infrastructure has been further updated to optimize the usability and the general user experience. Additional features where implemented to increase the visibility of published data and the acceptance of the infrastructure, which in the end also led to be even more FAIR compliant. Thereby the roadmap for scholarly data repositories [35] was taken into account. Subsequently, the most important extensions are described.

ORCID
To e ciently nd and access speci c research data les across millions of datasets, persistent identi ers like DOIs or URNs are very helpful and well established. But nevertheless, the research community is also quite large, and sometimes it is very di cult to distinguish data authors because of similar names or to identify the same researcher after he changes his a liation. With the Open Researcher and Contributor ID (ORCID) there is an easy and persistent solution to uniquely identify authors and to solve issues with name ambiguity [36]. An important advantage is the interdisciplinarity, because ORCID is used across nearly all research domains and organizations, e.g. at the mid of 2019 in Germany there were already 150.000 ORCIDs registered [37]. By linking authors with publications, a liations or funding agencies, it helps to nd relationships between researchers and their work and the corresponding research data.
Since the e!DAL infrastructure is generic and suitable for di erent kinds of research data, the ORCID system gives us an ideal solution to identify authors and improve the collected metadata for published datasets. Furthermore, the authors and their research data will get a better visibility, due to the connection between the ORCID infrastructure and infrastructure of the Dat-aCite consortium, which is handling the DOIs.
To add the possibility for assigning an ORCID to every data creator or contributor in the e!DAL infrastructure, the original PERSON data type [17] in the e!DAL metadata schema was extended. e!DAL uses the REST API of the ORCID registry to provide the possibility to search for the ORCID of a given name. In addition, it can be validated if an entered ORCID belongs to the corresponding name to prevent an accidently linking with a wrong ORCID. All these API functions were integrated into the graphical user interface of the data submission tool for the PGP repository. Furthermore, the content pages of published and DOI linked datasets were improved to provide direct links to the ORCID pro les of the associated authors and contributors of the data.

JSON-LD & DC meta tags
Another method of making research data interoperable as well as machine-readable is to embed the describing metadata using JavaScript Object Notation for Linked Data (JSON-LD) format. This approach provides comprehensive possibilities to harvest and reuse research data. JSON-LD is a data serialization and exchange method and was developed to be easily embeddable into various systems for providing interoperable web services [38]. The dynamic HTML templates for the content pages of the embedded webserver of e!DAL, which provides the URLs for resolving the assigned DOIs, have been extended accordingly. <s c r i p t type ="application / ld+json"> { "@context" : "http : / / schema. org" , "@type" : "Dataset " , "@id" : "https : / / doi . org /10.5447/IPK/2016/7" , "name" : "Raw images f i l e s from quantitative monitoring of . . . " , "publisher" : { "@type" : "Organization " , "name" : "IPK Gatersleben" } , "description" : "This dataset contains 30426 raw image f i l e s . . . " , "keywords" : "high throughput plant phenotyping , growth protocol . . . " , "inLanguage" : "en" , "author" : [ { "@type" : "Person" , "givenName" : "Astrid " , "familyName" : "Junker" , "address" : "IPK Gatersleben" } ] , "contributor" : [ { "@type" : "Person" , "givenName" : "Thomas" , "familyName" : "Altmann" , "address" : "IPK Gatersleben" } ] } Listing 1. Reduced example of the JSON-LD data from the content page of a DOI assigned with e!DAL, which is stored in the PGP repository total data volume number of stored data files  Listing 1 show an example for the JSON-LD description of a dataset in the PGP repository. The attributes are based on the schema.org ontology, which is a well-established and community-driven vocabulary used to structure digital data on websites. It is used and harvested by several common search engines [39] and provides an interoperability between dataset from separated resources and platforms.
Another alternative to JSON-LD are so-called HTML meta tags. They are embedded in the <head> section of an HTMLdocument and also allow to harvest the metadata and describe connections between datasets from di erent infrastructures. As the metadata schema of the e!DAL infrastructure is already inspired by the DublinCore metadata schema [40] the embedded HTML templates for the content pages of published datasets were extended to provide the technical metadata of every object also as HTML meta tags (see Listing 2). <meta name="DC. T i t l e " content="Screening of wild potato genetic . . . "> <meta name="DC. I d e n t i f i e r " content="https : / / doi . org /10.5447/IPK/2019/1"> <meta name="DC. Publisher" content="e !DAL -Plant Genomics and Phenomics . . . "> <meta name="DC. Language" content="en"> <meta name="DC. Description" content="This data set contains results of . . . "> <meta name="DC. Rights" content="CC BY-NC-SA 4.0"> <meta name="DC. Creator" content="Bachmann-Pfabe , S i l v i a . . . "> <meta name="DC. Contributor" content="Dehmer, Klaus . . . "> <meta name="DC. Subject" content="Phytophthora infestans"> <meta name="DC. Subject" content="germplasm c o l l e c t i o n"> Listing 2. Reduced example of the DublinCore Meta-Tags from the content page of a DOI assigned with e!DAL

Content Negotiation
Persistent DOIs provide a solution for long-term stable resolvability and referencing of all published datasets. In addition, for several reasons such as citing the datasets or harvesting the metadata, it is necessary to provide content negotiation to serve resources in di erent formats. Therefore the possibility to get di erent representations of the public datasets stored in an e!DAL infrastructure was implemented and can be used by several export functions, which were added on the corresponding content pages as shown in Figure 4. They provide the option to get textual representations, citation formats like BibTex or RIS and linked data formats like schema.org/JSON-LD and RDF for every dataset. Due to the fact that the DataCite service already provides a content negotiation feature, it was not necessary to implement a separate function for the embedded webserver of e!DAL. Instead, the HTTP handler uses the provided function for the di erent formats via a REST call and redirects the responses to the e!DAL infrastructure.

Elixir AAI
The e!DAL infrastructure provides a exible and embedded security concept based on the Java Authentication and Authorisation Service (JAAS). To provide the research data management and publication capabilities to a wide range of users from universities, research institutes or further organisations, a new login module using the ELIXIR Authentication and Authorization Infrastructure (AAI) [41] was implemented. It was designed to provide a single sign-on service for authenticating researchers to services, which are a part of the ELIXIR portfolio. Doing so, it combines the huge amount of existing organisational identity providers from institutes that are associated with ELIXIR under one roof.
The new e!DAL login module follows the OAuth protocol [42] to authenticate users over the ELIXIR AAI and automatically receive their email address, which is necessary for the communication between the data submitting researcher and the reviewers of the embedded review process. Furthermore the email address is used as a kind of internal ID to authenticate the user within the e!DAL security system [17]. As the rst use case, the new ELIXIR AAI based login was integrated into the PGP repository to open the infrastructure and the data  submission process for o ering the service to a wide range of researchers without the need of creating a separate account. The ELIXIR AAI allows researchers to use their existing organisational accounts (see Figure 5), which lowers the barrier to use the infrastructure and to reach a larger group of data providers.
Furthermore, with the opportunity to use the ELIXIR AAI, the already low e ort, which is necessary to establish further e!DAL installations, was reduced. Therefore at the end of 2018 a further e!DAL based repository at the Jülich Plant Phenotyping Center (JPPC) was established using the ELIXIR AAI login provider.

Amended frontend
The Apache Velocity template engine is used to render all HTML-based content of the e!DAL embedded webserver like the landing pages of published datasets and e-mail messages. This prevents the infrastructure from storing a massive amount of very similar websites and text drafts, which saves storage and provides a high performance for delivering content via the HTTP handler. All websites are provided dynamically on demand and created from only a few reusable templates. For the latest e!DAL version all content pages and the underlying templates were fully redesigned to provide a pleasing visual look and functional user experience. By using frontend frameworks and libraries like BootStrap and jQuery it is ensured that the user interface is responsive and working on both modern desktop browsers as well as on mobile devices. Figure  6 shows the new layout as an example screenshot of the embedded access statistic page of the PGP repository. Together with the new design for the frontend components of e!DAL, also the project website was renewed to provide comprehensive information for the user and for developers in a more concise manner.

Deployment and Usability
Since the last major release of the e!DAL infrastructure software a lot of optimizations and several new functionalities, which were described in the previous sections, have been implemented. Together with these improvements, changes in the general build and release process and in the usability have also been integrated. The most relevant of them are explained subsequently.

Gradle Multi-Build Project
After using the Maven build system Build System for several years for developing and releasing the e!DAL software components a change to the Gradle build tool was performed. Due to the constant increasing size of the project and the source code, because of new functionalities, several extensions and additional unit tests to guarantee a high software quality the build process using Maven takes quite a long time. This makes the regular release of stable versions very time-intensive. Fur-thermore, the build con guration became more complex and di cult to maintain. Gradle is strongly focused on a fast and speci c build cycle. It supports multi-core systems to a high degree and allows e.g. the execution of several test suites in parallel. With the change to the build infrastructure, we also decided to redesign the entire project build hierarchy and created a multi-build project for the e!DAL infrastructure. It contains the main API components including the reference implementation as well as the components for the server-client architecture, which is directly based on this core implementation. This approach massively accelerates the build time, simpli es the maintenance and allows a more frequent deployment of new versions. The project is now available in a new BitBucket repository.
Nevertheless, the API is still released as an artifact in the central Maven Repository and can be integrated into other software projects using Maven or Gradle, as shown in Listing 3.

OS speci c executables
Due to the complete new development and release cycle by Oracle, the Java programming environment, which is the basis for the e!DAL infrastructure, changes a lot in recent years. In addition, the comprehensive redesign and reconstruction of the language itself, like the introduction of the new module concept or the removal of popular and formerly native APIs and frameworks like JavaFX or the Java Network Launching Protocol (JNLP), which was the basis for Java web start applications, were some very substantial changes. This strongly in uences the e!DAL implementation, because they were also a signicant part of the previous version. Unfortunately this impeded at some points the further development of e!DAL infrastructure, because a lot of the used frameworks and libraries needed several months to update their code to be compatible with the latest Java versions. With the new version 3.0.0 the e!DAL infrastructure is fully based on the Java Runtime Environment (JRE) 12. Therefore some comprehensive changes were necessary. In order to run e!DAL with the di erent existing runtimes, e.g. the o cial runtime from Oracle, but also the alternative and widely used OpenJDK, it was necessary to integrate the JavaFX library directly into the implementation. This increases the actual size of the API package, but it makes the infrastructure much more compatible and even more independent from the system preconditions than before.
The removal of the support for the popular and well known JNLP was also a high challenge, because the Java webstart tool was used to give the user an intuitive and platformindependent way to run the graphical data submission tool. Nevertheless this solution also provides some shortcomings like the need to provide an installed and compatible Java runtime. With the recently developed jpackage Java provides a powerful tool to pack self-contained applications along with a suitable JRE. We used jpackage to create a full image of the e!DAL data submission tool together with a reduced JRE, which contains only the necessary java modules and provides separate executables for the most common operation system (Windows, Unix, MacOS). This provides a very convenient usability for data submitter and makes the infrastructure again more compatible and independent from the given system preconditions of the users.

Web-based submission application
In parallel to the update process due to the previously mentioned changes in the Java Runtime Environment and the development of the build process to create the self-executable applications for the submission dialog, a new web-based application was implemented to provide an alternative opportunity to upload research data to an e!DAL based infrastructure. The goal was the deployment of a user-friendly web application with the similar functionality of the corresponding desktop tool, but without the need to download the application as an executable or additional plugins. The Vaadin framework for Rich Internet Applications (RIA) was used for the implementation.  shows a screenshot of the web application. By using several REST APIs, e.g. from the ORCID Registry or the ELIXIR AAI, a light-weight application could be created providing the same functionality as the full desktop client. Furthermore,users now have the possibility to submit research data also from mobile devices or other browser compatible devices. The only small shortcoming of the data submission via the web application is currently that not all browsers support the upload of comprehensive le folders. The latter is only possible if a recent version of Google Chrome or Mozilla Firefox is used. Other web browsers only allow the upload of single les.

Results
In this article the basic overall 'on-premise' data management and publication concept of the e!DAL infrastructure as well as several new features and technical developments were presented. As a result, e!DAL matured to a comprehensive and FAIR-compliant infrastructure, while always keeping the focus on simple and exible setup and integration into exist-ing infrastructures and into the daily research process. With the described 'bring the infrastructure to the data' approach, it di ers fundamentally from generic publication platforms like gshare or DRYAD, which can produce, depending on the needed storage, considerable nancial costs and time costs for transferring the data. e!DAL allows the usage of available inhouse storage capacities, without the need of complex requirements and technical infrastructures or comprehensive adaptations. All functionalities are already included and the provided reference implementation contains required components, such as a database or a webserver. This is a crucial advantage in comparison to other similar software infrastructures, like Data-Verse or CKAN, and lowers the barrier to establish a publication infrastructure even for small-size research institutions with limited possibilities and know-how. Thereby the FAIR compliance can be ful lled by several e!DAL functions and components: • Findable: By providing embedded and machine-readable metadata based on standardized established formats, the e!DAL published datasets can be easily found using common search engines like Google or the DataCite Metadata Search. Due to the widely established and used DOIs, the DataCite consortium is also involved in several projects and interacts with di erent systems like ORCID, CrossRef or Scholix. This further improves the ndability of e!DAL datasets. • Accessible: e!DAL fully support the usage of DOIs as persistent identi ers to guarantee a long-term stable availability of published datasets. The DataCite resolver for the DOIs allows simple access to the data and reference datasets, e.g. in a research article or as part of data publication. If the storage location of the underlying data is changing, the corresponding DOI remains stable and allows the uninterrupted access to the data by updating the resource path. At that point the embedded web server of e!DAL takes care that every published DOI is accessible via a comprehensive content page. It provides the opportunity to navigate through the dataset and download certain les, and furthermore the access of the metadata and a direct linkage to the ORCID registry. • Interoperable: To provide interoperable datasets and to allow the aggregation of information about the relationship of datasets from di erent sources, the e!DAL infrastructure supplies embedded metadata on the content pages of every data object. They are stored using standardized formats and vocabularies like JSON/LD or rather schema.org. • Reusable: By collecting a standardized set of mainly technical metadata e!DAL guarantees a long-term readability and usability of all published datasets. The schema is inspired by the DublinCore metadata format and meets community established standards. Furthermore, a clear and easy license handling allows to assign a suitable license, which de nes by whom and how the data can be used. They are available both on the content page of every data object as well as embedded in the HTML sources.
The concept of e!DAL to expose even dark [43] and semistructured research data is also applied to metadata. They are divided into technical metadata, which are stored within e!DAL, and speci c semantic metadata. This means a trade-o between a high volume of FAIR enabled research data in technical means and exposing of high quality semantic metadata. Therefore, we propagate a two-step procedure, whereas the rst step is to share and even preserve research data without semantic metadata that would otherwise tend to get moved into the at-tic of dark data. This is because there is frequently a discrepancy of community accepted policies and practical resources for their execution in practice for data capture and publication process. It is still a matter of resources and research data management culture to consider semantic data annotation as reputational task within scienti c credit system. This has to be accompanied by research institute policies and data steward concepts. Nevertheless, until such general cultural change and its wide implementation in the research landscape, we aim minimally at exposing research data even with technical metadata only. The major goal of the e!DAL development is providing a generic and data domain agnostic infrastructure, which could be set-up and integrated easily. Therefore, the second phase of semantic metadata annotation must be anchored within the research organizations and their de ned reviewers to take care that published datasets are in the scope of their repository and provide suitable metadata. This enables e!DAL to ensure FAIR data by technical metadata to mint DOI and guarantee a longterm preservation of research data. For example, the mentioned PGP repository, which is hosted at the IPK Gatersleben focuses among others on plant phenotyping data. Therefore, the reviewers carefully check if submitted datasets providing phenotypic data contains a MIAPPE [15] compliant metadata. A further challenge for scientist is to choose the suitable public and community approved repository to share their data, as long as they have no recommendation e.g. by journal publishers or research data management plans de ned by the project. This is part of the review process for the PGP repository. If the appropriate repository for data submissions is available, the submission is rejected and the suggested deposition alternatives is communicated to the author. Further e!DAL based repositories may focus to other domain related policies and evaluate submissions under di erent aspects, which is by the generic and open concept of the e!DAL software.

e!DAL Usage
Established in 2016, the PGP repository is the rst productive repository based on the e!DAL infrastructure and a part of the service portfolio of the GCBN unit (German Crop BioGreenformatics Network) [25] of de.NBI (German Network for Bioinformatics Infrastructure) [44], which is the head of ELIXIR Germany. After more than three years of productive usage, the PGP repository currently shares comprehensive, plant-related research datasets containing mainly genomic and phenomic information, but also metabolic datasets or software components and pipelines. Most of the datasets are part of a corresponding research paper and allow authors from IPK, but also from other institutes, to improve their manuscripts by enriching them with the underlying research data in a FAIR compliant way. The overall download volume and large number of distinct user accesses show the high visibility of the provided data sets and the interest of the research community for this kind of research data.
The integration of the ELIXIR AAI into the login mechanism of the PGP Repository is a prime example that shows how established platforms can bene t from the ELIXIR network. The provided services contribute to the increase of the visibility, to overcome the obstacles for the use of available infrastructures and to support FAIR compliant access to research data. The support of the ELIXIR single sign-on service enables collaborators to use the PGP repository as a service to publish their research data. Furthermore, the ELIXIR AAI login is fully integrated into the e!DAL infrastructure software, which allows to set up further FAIR in-house repository instances, following the presented I2D approach. Doing so in June 2018 a second repository based on the developed e!DAL infrastructure was established at the Forschungszentrum Jülich. Due to the auto-con guring installation it was possible to run the system and provide the submission and review work ow with only a little e ort in time. The integrated ELIXIR AAI login allows researchers from Jülich to use their existing institutional accounts. The complete infrastructure is hosted and maintained by the Jülich Plant Phenotyping Center (JPPC). The process of establishing further e!DAL-based repositories at the Julius-Kühn Institute and the Helmholtz Centre München are currently underway.

Outlook
In this work, we showed the newly designed I2D concept for FAIR compliant data publication by using in-house storage infrastructures and new features of the e!DAL platform. After several years of operating a productive instance of this infrastructure as the basis for the PGP repository, we recorded high numbers of accesses and downloads. Although researchers have more and more possibilities to share their research data with the community, the incentive to do so is still not high enough for some researchers [45]. In contrast to the common peer-reviewed publication in journals, it is not so easy to measure the impact of research data itself, because the concept of data citation is still not a common practice [46], but it becomes more and more important and accepted [47]. It's not only a cultural problem, but also a technical challenge and therefore an issue of practicability [48]. One of the rst metrics to count data citations was the commercial Data Citation Index. But in the meanwhile some free and community initiated projects like Make Data Count have been developed. Furthermore, popular journals are starting to demand that authors put their research data as data citations in their common reference list [49]. This facilitates to measure its impact through a citation index and improves the visibility to readers which in turn increases the general acceptance of research data as valuable scienti c assets. In future we will investigate several approaches for counting data citations and getting more credit for publishing research data. We plan to integrate a generic and open-source solution into the e!DAL infrastructure to show users comprehensive information how their data is reused and referenced.
The ORCID provide a widely accepted and used solution to unambiguously identify researchers. The integration within the e!DAL infrastructure is very intuitive and facilitates handling of multiple ORCIDs for comprehensive lists of authors. Besides the identi cation of persons, it can be also quite challenging to handle the diverse a liations of research institutes, universities or companies with a focus on di erent scienti c topics. Some authors have multiple a liations, from time to time organizations may be renamed, the o cial addressee may change due to infrastructural developments or it may happen that an institute will be closed. The Research Organization Registry (ROR) provides an open and sustainable approach, which is led by the community and supported by popular organizations like DataCite or Dryad. The concept of the ROR identiers is very similar to the ORCIDs and allows to uniquely identify all kinds of research organizations. Therefore, one of the next functional improvements for the e!DAL infrastructure will be the integration using the provided ROR API. This will cause some changes in the basic data structure, which however will result in a much easier and FAIRer way to handle author a liations [50].