A molecular map of lung neuroendocrine neoplasms

Abstract Background Lung neuroendocrine neoplasms (LNENs) are rare solid cancers, with most genomic studies including a limited number of samples. Recently, generating the first multi-omic dataset for atypical pulmonary carcinoids and the first methylation dataset for large-cell neuroendocrine carcinomas led us to the discovery of clinically relevant molecular groups, as well as a new entity of pulmonary carcinoids (supra-carcinoids). Results To promote the integration of LNENs molecular data, we provide here detailed information on data generation and quality control for whole-genome/exome sequencing, RNA sequencing, and EPIC 850K methylation arrays for a total of 84 patients with LNENs. We integrate the transcriptomic data with other previously published data and generate the first comprehensive molecular map of LNENs using the Uniform Manifold Approximation and Projection (UMAP) dimension reduction technique. We show that this map captures the main biological findings of previous studies and can be used as reference to integrate datasets for which RNA sequencing is available. The generated map can be interactively explored and interrogated on the UCSC TumorMap portal (https://tumormap.ucsc.edu/?p=RCG_lungNENomics/LNEN). The data, source code, and compute environments used to generate and evaluate the map as well as the raw data are available, respectively, in a Nextjournal interactive notebook (https://nextjournal.com/rarecancersgenomics/a-molecular-map-of-lung-neuroendocrine-neoplasms/) and at the EMBL-EBI European Genome-phenome Archive and Gene Expression Omnibus data repositories. Conclusions We provide data and all resources needed to integrate them with future LNENs transcriptomic studies, allowing meaningful conclusions to be drawn that will eventually lead to a better understanding of this rare understudied disease.


Background
Lung neuroendocrine neoplasms (lung NENs or LNENs) are rare understudied diseases with limited therapeutic opportunities. Lung NENs include poorly di erentiated and highly aggressive lung neuroendocrine carcinomas (NECs)-i.e., smallcell lung cancer (SCLC) and large-cell neuroendocrine carcinoma (LCNEC)-as well as well-di erentiated and less aggres-sive lung neuroendocrine tumors (NETs)-i.e., typical and atypical carcinoids (WHO classi cation 2015 [1]). Over the past years several genomic studies have investigated the molecular characteristics of these diseases in order to provide some evidence for a more personalized clinical management [2,3,4,5,6,7]. Although lung NECs and NETs are broadly considered as di erent diseases, several recent studies have suggested that they may share some molecular characteristics [8,9,7,10,11].
However, due to the rarity of these diseases, the sample sizes of these studies individually are limited, and the integration of independent datasets is not an easy task.
Providing a way to interactively visualize and analyze these pan-LNEN data would be of great interest for the scienti c community, not only to further explore the proposed molecular link between lung NECs and NETs, but also to integrate data from studies including fewer samples to reach the statistical power needed to draw meaningful conclusions.

Data Description
Recently [7], we performed the rst integrative and comparative genomic analysis of lung NEN samples from all histological types, based on newly sequenced and publicly available wholeexome data (WES, 16 samples), whole-genome data (WGS, 3 samples), RNA-Seq data (20 samples), and EPIC 850K methylation data (76 samples). These data correspond to the most extensive multi-omic dataset of lung NENs, including the rst methylation data for LCNEC and the rst molecular characterization of the rarest lung NEN subtype (atypical carcinoids) [7]. This dataset, which provides the missing pieces for a complete molecular characterization of lung NENs, have been deposited at the EMBL-EBI European Genome-phenome Archive (EGA accession number EGAS00001003699). In order to facilitate the reuse of the data generated in the previous manuscript [7], we provide here a complementary data descriptor by outlining the preprocessing and the quality control (QC) steps performed on each omic dataset available on EGA.
Also, other previous studies have generated sequencing data and performed a molecular characterization of lung NEN samples: pulmonary carcinoids (mostly typical carinoids) have been characterized by Fernandez-Cuesta et al. [4], LCNEC by George et al. [6] and SCLC by George et al. [5] and Peifer et al. [2]. We therefore generate the rst pan-LNEN molecular tumor map by integrating the transcriptomic data from Alcala et al. [7] and the other published lung NEN transcriptomic data [2,4,5,6]. This map provides an interactive way to explore the molecular data and allows statistical interrogation, based on the UCSC TumorMap portal [12]. The integrated transcriptomic dataset resulting from these studies is available on GitHub [13]. Figure 1 provides a schematic view of the preprocessing steps and the associated quality controls performed for each omic dataset generated by Alcala and colleagues [7].

WES and WGS data
Sequencing reads from 16 atypical carcinoids whole-exomes and 3 carcinoids whole-genomes were processed using an in-house Next ow [14] work ow (see Code Availability section) consisting in four steps: mapping reads to the reference genome (GRCh37), marking duplicates, sorting reads, and performing local realignment (for WES data only) using bwa (v0.7.12-r1044) [15], samblaster (v0. 1.22) [16], sambamba (v0.5.9) [17] and ABRA (v0.97bLE [18]) respectively. The quality controls of the WES and WGS data were performed using FastQC v0.11.8 [19] and QualiMap v2.2.1 [20] and the results aggregated using MultiQC v1.7 [21] (Figure 1, left panel). Figure 2A-B, show the per base sequence quality scores (left panels) and the per sequence mean quality scores (right panels). Regarding the per base sequence quality scores, the majority of the base calls were of very good quality (>28, green area, Figure 2A left panel) and of reasonable quality (>20, orange area, Figure 2B left panel) for WES and WGS data respectively. The most frequently observed sequence mean quality score was around 30 for both techniques, which is equivalent to an error probability of 0.1%. Table 1 provides the general statistics associated to the WES and WGS quality controls. The observed median coverage for each sample was above the expected coverage (30X for the WGS samples and 120X for the WES samples).
Concerning the alignment quality, all WES samples had more than 99% of the reads aligned and all WGS samples had more than 98% of the reads aligned.

Before filtering
After filtering

Mapping statistics Mapping location statistics Expression quantification statistics
Methylation QCs LNEN003  LNEN004  LNEN005  LNEN006  LNEN007  LNEN008  LNEN009  LNEN010  LNEN011  LNEN012  LNEN013  LNEN014  LNEN015  LNEN016  LNEN017  LNEN018  LNEN019  LNEN020  S00716_B   0  10  20  30  40 50 100 Figure 2. Quality controls performed on each omic dataset. A) Reads quality control using FastQC for WES data. B) Reads quality control using FastQC for WGS data. C) Reads quality control using FastQC for RNA-Seq data. For A, B, and C, the left panels correspond to the sequence quality plots, the x-axis representing the base position in the read and the y-axis the mean quality value; the right panels correspond to the per sequence quality scores plots, the x-axis representing the mean quality score and the y-axis the number of reads. D) Quality control of the RNA-Seq data after trimming. Left panel: barplot representing the percentages of reads uniquely mapped ("Uniquely mapped"), mapped to multiple loci ("Mapped to multiple loci" or "Mapped to too many loci" if the number of loci is higher than 10), unmapped because the mapped reads' proportion was too small ("Unmapped: too short"), or unmapped for other reasons ("Unmapped: other"). Middle panel: cumulative barplot representing the percentages of reads mapped, using RSeQC, at di erent locations in the genome (exons, introns, 5' and 3' UTR, TSS, and TES). Right panel: cumulative barplot representing the cumulative percentages associated to the di erent reads assignments at the expression quanti cation step using HTSeq ("Assigned": reads assigned to one gene, "Ambiguous": reads assigned to multiple overlapping genes, "Aligned not unique": reads assigned to multiple non-overlapping genes, "No Feature": reads assigned to none of the features). E) Left panel: samples' quality based on log median intensities. The x-axis and y-axis correspond to the median of log2 methylated and unmethylated intensities, respectively. Right panel: representation of the between-sample similarities based on the two rst MDS dimensions. F) Histogram of the median detection p-value for each sample. G) Distribution of the beta values for each sample before and after the ltering step (left and right panel respectively). Figure 2C shows that the base calls, before trimming, are of good quality since all samples have a mean per base sequence quality score higher than 28 (left panel) and for all samples the most frequently observed per sequence mean quality is above 35, corresponding to an error probability of 0.03%, (right panel). None of the samples presented more than 1% of overrepresented sequences, which assures a proper library diversity. RSeQC v2.6.4 [26] was used to control the alignment quality and to assign mapped reads to di erent genomic features (exons, introns, TSS, TES). Figure 2D (left panel) shows that every sample had more than 80% of reads uniquely mapped and the reads distribution for each sample is represented on Figure 2D (middle panel). All samples had more than 80% reads mapped in coding regions (CDS-exons, 5' and 3' UTR exons).
The reads counting was performed at the gene level for 57,822 genes (genecode annotation, release 19) using HTSeq [25]. Figure 2D (right panel) shows the reads assignments, the percentage of assigned reads ranges from 65.7 to 82.7%. STAR, RSeQC and HTSeq metrics for each sample are provided in Supplementary Tables 1-3. Note that three samples, LNEN008, LNEN014 and LNEN017, have a higher proportion of reads classi ed as "Unmapped too short" ( Figure 2D, left panel), reads mapped in intronic regions ( Figure 2D, middle panel) and a lower proportion of reads assigned by HTSeq ( Figure 2D, right panel) in comparison to the other samples. Unexpected results concerning those samples should be thus considered with caution.
Finally, in order to apply dimensionality reduction methods to the RNA-Seq data (see below), the DESeq2 package v1.14.1  [27] was used to transform the read counts to variance stabilized read counts (vst), enabling the comparison of samples with di erent library sizes. To reduce sex in uence on expression pro les, the genes located on sex chromosomes were not considered for subsequent analyses. Genes located on mitochondria chromosomes were as well not considered.

Methylation data
The methylation analyses were performed based on the EPIC 850K methylation arrays and the In nium EPIC DNA methylation beadchip platform (Illumina) for 33 typical carcinoids, 23 atypical carcinoids, 20 LCNEC and 19 technical replicates in total. These arrays interrogate more than 850,000 CpGs and contain internal control probes that can be used to assess the overall e ciency of the sample preparation steps. The raw intensity data (IDAT les) were processed using the R package min (v.1.24.0) [28]. Figure 1 (right panel) provides the packages, functions and publication used for the data processing, quality control and ltering steps. Figure 2E shows that no outliers were detected: i) the left panel, representing the median log2 of the methylated and unmethylated intensities, indicates that all samples cluster together with a log median intensity above 11 for both channels, which supports the absence of failed samples, ii) on the right panel, the multidimensional scaling (MDS) plot shows that the samples cluster together by histological groups. We used the depectionP function (min package), which compares the DNA signal to the background signal based on the negative control probes to provide a detection p-value per probe, lower p-value indicating reliable CpGs. Figure 2F represents the mean detection p-values per sample and shows that all samples mean detection p-values were lower than 0.01. To correct for the variability identi ed in the control probes, a normalization step was applied to the raw intensities using the preprocessFunnorm function from min .
After between-array normalization, di erent sets of probes that could generate artefacts were removed successively from the methylation dataset: i) 19634 probes on the sex chromosomes, in order to identify di erences related to tumors but unrelated to sex chromosomes, ii) 41818 cross-reactive probes which are probes co-hybridizing with multiple CpGs on the genome and not only to the one it has been designed for [29], iii) 10588 probes associated with common SNPs (present in db-SNP build 137), iv) 24363 probes with multi-modal beta-value distribution, and v) 9697 probes having a detection p-value higher than 0.01 in at least one sample. Supplementary Table  4 lists the sets of ltered probes. To assess the experimental quality of the assay, the distributions of the beta values were analyzed. As described previously, probes with multi-modal distributions were removed at the ltering step and overall distributions of beta values for each sample before and after ltering were plotted ( Figure 2G). As expected, after ltering all samples showed a bimodal pro le, indicative of the good quality of the experiment. No experimental batch e ects were identi ed after functional normalization (see Supplementary Fig.  33 from [7]). Based on all the quality controls performed, none of the samples analyzed were identi ed as outlier. However, one sample available on EGA (201414140007_R06C01), was removed from the analyses because it came from a metastatic tumor rather than the primary tumor.

Generation of an integrative molecular map
Here we have generated a pan-LNEN molecular map with the whole-transcriptomic (RNA-Seq) data available from individual studies of each lung NEN tumor types [2,4,5,6,7]. This dataset includes the RNA-Seq data for a total of 51 SCLC, 69 LCNEC, 88 carcinoids including 27 atypical and 58 typical carcinoids. The di erent data underwent the same processing steps described above since the generation of the molecular map requires a homogenized dataset.

UMAP method
The pan-LNEN map was obtained using the Uniform Manifold Approximation and Projection (UMAP) method [30] on the genes with the most variable expression (genes explaining 50% of the total variance). UMAP is a dimensionality reduction method based on manifold learning techniques, which are adapted to non-linear data in contrast with the commonly used PCA method. Firstly, it builds a topological representation of the high-dimensional data, and secondly it nds the best lowdimensional representation of this topological structure [30]. UMAP representations were generated using the umap function from the R package umap (v. 0.2.4.0) [31]. All the parameters were set to their default values except the n_neighbors parameter. This parameter de nes the number of neighbors considered to learn the structure of the topological space. Varying this parameter from small to large values enables the user to  nd a trade-o between local and global preservation of the space, respectively. In order to preserve the global structure of the data (see "quality control of the UMAP projection" section below), we built the pan-LNEN map setting the n_neighbors parameter to 208, which corresponds to the total number of samples. Figure 3 shows the pan-LNEN map available on TumorMap [32] (see "Re-use potential" section below), with colors representing the main molecular subtypes. To evaluate the accuracy of the generated pan-LNEN map we rstly veri ed if it is consistent with the main biological ndings from the original studies, in particular if it represents the molecular subtypes of lung NENs previously identi ed, and their relationship with histological types. We speci cally tested if groups of samples previously described as having discordant molecular and histopathological groups were identi ed in our map. To do so, given a focal molecular subtype and two reference histopathological types, we assessed whether samples from the focal molecular subtype were closer to one of the two references using a onesided Wilcoxon test between the euclidean distances of samples to the centroid of each reference type.

Biological interpretation of the pan-LNEN TumorMap
First, the SCLC/LCNEC-like samples [6], which are SCLCs presenting the molecular pro le of LCNEC, tend to cluster with the LCNECs rather than with the SCLCs (Wilcoxon p-value = 3.9×10 -3 ). Similarly, the LCNEC/SCLC-like samples [6], which are LCNECs having the molecular pro le of SCLC, tend to cluster with the SCLCs rather than with the LCNECs (Wilcoxon pvalue = 9.6 × 10 -3 ). In 2018, George et al. showed that LCNEC samples can be subdivided into the type-I and type-II molecular groups [6]. We observed that the type-I and type-II LCNECs were closer to each other than to the SCLC/SCLC-like (Wilcoxon p-value = 1.5 × 10 -11 ) and that SCLC/LCNEC-like samples were closer to type-II than type-I LCNECs [6] (Wilcoxon p-value = 9.3 × 10 -4 ). Like the LCNECs, pulmonary carcinoids have been subdivided in molecular groups. Alcala et al. [7] identi ed three clinically relevant molecular clusters, using a multi-omics fac-tor analysis (MOFA): Carcinoid A1, Carcinoid A2, and Carcinoid B [7]. In the pan-LNEN map generated using UMAP, those three clusters are clearly visible ( Figure 3). Also, in the study from Alcala and colleagues [7], two carcinoids that clustered with the carcinoids B (S00118 and S00089) were borderline and located between cluster A1 and A2. Similarly, a LCNEC sample and a SCLC sample clustered with the carcinoids A1 [7]. These observations are also visible on the TumorMap representation. Finally, in the same study, a novel entity of carcinoids, named the supra-carcinoids was unveiled. These samples were characterized by a morphology similar to that of pulmonary carcinoids but the molecular features of LCNEC samples. In the pan-LNEN TumorMap, the supra-carcinoids also cluster with the LCNEC samples and were molecularly closer to LCNECs than to SCLCs (Wilcoxon p-value = 5 × 10 -2 ).

Quality control of the UMAP projection
In any dimensional reduction technique, there is a trade-o between preserving the global structure of the data and the ne scale details, and UMAP has been designed to reach a better balance compared to previous methods.
Based on the previously published analyses of lung NEN data [2,4,5,6,7], we expect the global structure of the data to be composed of six molecular groups (SCLCs, type I and type II LCNECs, Carcinoid A1, A2 and B). For this reason, an ideal projection able to capture this large scale variation should contain ve dimensions. To assess the quality of the 2-dimensional representation generated by UMAP, we propose a comparative analysis between UMAP and the traditional principal component analysis (PCA) based on the ve rst principal components of PCA (PCA-5D) as implemented in the dudi.pca function from the ade4 R package (v1.7-13) [33]. Because UMAP is aiming at preserving the global structure in only two dimensions, we also compared it to the traditional PCA based only on the two rst principal components (PCA-2D). We evaluated the performance of the methods based on the preservation of: (i) the samples' neighborhood and (ii) the spatial auto-correlations.

Preservation of the samples' neighborhood
We used the sequence di erence view (SD) metric (eq. 3 from [34]) to evaluate the preservation of the samples' neighborhood. This dissimilarity metric compares, for a given sample, its neighborhood in the low-dimensional space with that in the original space, taking into account that preserving the rank of a close neighbor is more important than for a distant neighbor (see [34] for details). SD values are positive (SD ∈ [0 ; +∞)), with small values indicating a good preservation of the samples neighborhood. We denote by SD k the value of SD averaged across samples for a xed number of neighbors k; SD k gives a sense of the overall preservation of the neighborhood at di erent scales: local for low k values and global for large k values. We calculated SD k for PCA-5D, PCA-2D, UMAP with n_neighbors = 208 and UMAP with the default value n_neighbors = 15. Because we are interested in the relative values of SD k for the di erent dimensionality reduction methods, and because we use PCA as a reference, for each dimensionality reduction method X we scaled the values of SD k using that of PCA-5D and PCA-2D: By de nition, SD k,PCA-5D = 0 and SD k,PCA-2D = 1. Thus values of SD k,X close to 0 indicate that X preserves k neighborhoods as well as PCA-5D, whereas values close to 1 indicate that X preserves k neighborhoods worse than PCA-5D but as well as PCA-2D, and values greater than 1 indicate that X preserves k neighborhoods worse than PCA-2D and PCA-5D. Note that SD k,X can be negative if X preserves k neighborhoods better than SD k,PCA-5D . For the UMAP projection, we iterated the computation of SD k 1000 times, because the algorithm uses a stochastic optimization step to de ne the projection. As expected, increasing the n_neighbors UMAP parameter from 15 to 208 leads to a better preservation of the global structure, clearly visible for k > 20 ( Figure 4A; mean SD k>20 equals to 2.144 and 0.758 respectively), while only marginally reducing the preservation of the local structure for k < 20 (mean SD k<20 equals to -0.042 and 0.120 respectively), which is approximately the size of the smallest cluster. Globally, the SD k values over all k levels are lower for a n_neighbors value of 208 than 15 (paired t-test p-value = 2.76 × 10 -7 ). With n_neighbors = 208, the UMAP projection provides a clear improvement over PCA-2D for k < 150 (mean SD k < 1), o ering a good trade-o for visualisation in only two dimensions while being able to maintain the global structure of the data, in particular the six molecular groups previously identi ed. This observation highlights the importance of varying the n_neighbors parameter according to the purpose of the projection. Some analyses would require to maintain the local structure of the samples neighborhood while others the global structure.

Preservation of spatial auto-correlations
Under the hypothesis that close points on projections share a similar molecular pro le, spatial auto-correlations were measured according to the Moran Index (MI) metric [35]. MI values range from -1 to 1, the extreme values indicating negative (nearby locations have dissimilar gene expression) or positive (nearby locations have similar gene expression) spatial auto-correlation, respectively. The spatial auto-correlation of the expression of each gene helps to identify the genes contributing to the structure of the molecular map (MI 1), and conversely, the genes that are randomly distributed spatially (MI 0). The computation of MI requires a weight matrix that determines the spatial scale at which auto-correlation is assessed; we gave a weight of 1 to the k nearest neighbors based on Euclidean distance, and 0 otherwise, so that we can control the scale at which MI is computed with parameter k. The mean MI across k values was computed for all gene expression features for: (i) the original space, (ii) the PCA-5D projection, and (iii) the UMAP projection (with n_neighbors = 208). We used the implementation of MI from the Moran.I function of R package ape (v. 5.3) [36].
To evaluate the preservation of the spatial autocorrelations, we ranked the top N genes based on the mean MI values for these three cases and calculated the overlap between the lists ( Figure 4B). We found that the PCA-5D is only slightly more conservative of the spatial auto-correlations found in the original space than UMAP (unilateral paired t-test p.value = 2.2 × 10 -16 ). For example, for N = 1000 (see Euler diagram inserted in Figure 4B), 85% of the genes with the highest MI overlap between the PCA-5D, UMAP and the original space.

Re-use potential An interactive TumorMap
Newton and colleagues have recently developed a portal called TumorMap [12,37], an online tool dedicated to omics data visualization. This new type of integrated genomics portal uses the Google Maps technology design to facilitate visualization, exploration, and basic statistical interrogation of high dimensional and complex datasets. The pan-LNEN molecular map that we generated in this work ( Figure 3) has been shared on the TumorMap platform. Along with the molecular map, the main clinical, histopathological and molecular features highlighted in the previous studies were uploaded as attributes. The interface enables users to explore and navigate through the map: zooming in and out, coloring and ltering samples based on attributes. The users can also create their own attributes based on pre-existing ones by using operators such as union or intersection. In addition, multiple statistical tests are pre-implemented and available, for example: comparison of attributes without considering the samples positions on the map, comparison of attributes considering samples positions on the map, and ordering attributes based on their potential to di erentiate two groups of samples. The interactive nature of the map and the fact that its manipulation does not require computational expertise, could enable the generation of new hypotheses and expand the reuse potential of the dataset.

An interactive computational notebook
In the rst part of the paper, we described the pre-processing and quality control steps applied on the recently published lung NEN multi-omics dataset [7] in order to facilitate its reuse. To generate the pan-LNEN molecular map, the same pre-processing steps were followed to homogenize independently published transcriptomic data [2,4,5,6,7]. For that purpose, reproducible pipelines, developed in house, were used and are available for reuse to the scienti c community on GitHub [38] (see the "availability of source code" section). In addition, the code used to generate the molecular map and to evaluate the quality of the dimensionality reduction is provided as a notebook published on Nextjournal [39]. Along with the code, the notebook provides the data and the dependencies required to run the analyses performed in this paper. Interested researchers can thus make a copy of this publicly available notebook (called "Remix") to reproduce our results but also interactively modify the code and explore the in uence of di erent parameters.

Integration of new samples
The homogenized read counts of the pan-LNEN data are available on GitHub [13]. Along with the available code, these data could be used to integrate new samples for which RNA-Seq data are available. The raw read counts of the new samples should rstly be generated following the same processing steps described in the section "Data quality controls" (Figure 1,middle panel) and integrated to the pan-LNEN read counts. We also provide in the Nextjournal notebook, the Next ow command allowing to obtain the read counts. The variance stabilized transformation (DESeq2 [27]) should then be applied on the combined data set and UMAP should nally be rerun to project all samples together in a two dimensional space. All together, we provide the resources to integrate additional samples into our molecular map, starting from raw sequencing read counts.

Discussion
Genomic projects focused on rare cancers encounter the limitation of availability of good quality biological material suitable for such studies. This translates in small series of samples usually underpowered to draw meaningful conclusions. Thus, tools facilitating the integration of independent datasets into larger sample series will lead to more informative studies. Recently, the rst multi-omic dataset for the understudied atypical pulmonary carcinoids and the rst methylation dataset for LCNECs was published [7]. Here we provide a parallel description of the pre-processing of these molecular data and provide evidence of the good quality of the di erent 'omics data generated. This data collection associated with previous datasets [2,4,5,6] completes the lung NENs molecular landscape and provides thus a valuable resource to improve the molecular characterization of lung NEN tumors.
However, even when primary genomic data is available, barriers to accessing the data still exist, often limiting its reuse by the community [40]. In particular, downloading and rereprocessing large raw sequencing data requires dedicated infrastructure and bioinformatics skills. Indeed, in order to minimize batch e ects when integrating data from di erent studies, one need to process it exactly in the same way (with the same software and the same versions, the same reference genome, the same annotation databases etc.). As more and more data are generated, the previously mentioned reprocessing will become untenable and replicating these e orts for each new study in each research group represents a waste of resources. Standardization of laboratory and computational protocols might become a reality when large national medical genomics initiatives will be fully operational [41]. In the meantime there is a need for better data sharing strategies than the traditional "supplementary spreadsheet / raw data" combination that can accelerate the translational impact of molecular ndings.
One step in this direction is the generation of so called "tumor maps", which provide an interactive way to explore the molecular data and allow easy statistical interrogation, including generating new hypotheses, but also projecting data from future studies including fewer samples [12]. This integration method has some limitations though. A xed reference map could be of interest for easier biological interpretations, but the overall sample size of the datasets used to build the pan-LNEN map remains relatively small. Thus, the map does probably not capture the complete molecular diversity of the lung NENs, and integrating new samples will in uence the map and potentially change the clusters obtained after dimensionality reduction. Also, if the harmonization of the new dataset to integrate is not enough to correct for strong batch e ects, the interpretation of the projections would be erroneous. Another approach would be to project the new samples into a xed reference map. However, the stochastic nature of UMAP embedding and its sensibility to parameter tuning can lead to unstable projection results, thus this task is for now not straightforward and requires further development [42]. In the meantime, favoring the integration of datasets will, over the years, yield to the constitution of molecular maps that will probably be more and more accurate and more adapted to the projection of new samples.

Conclusion
Here we provide a molecular map based on homogenized transcriptomic data available for the four types of lung NENs from ve di erent studies. We show that this map represents well both the local and global structure of the data, and captures the main biological features previously reported. We provide a full spectrum of data and tools to maximize its re-use potential for a wide range of users: raw sequencing reads, gene expression matrix, bioinformatics pipelines, interactive computational notebooks and an interactive TumorMap. In particular, we indicate how one can update the molecular map by integrating new samples starting from raw sequencing reads. Considering the small sample sizes of molecular studies on rare lung NENs, promoting data integration will empower more reliable statistical testing, and this map will therefore serve as a reference in future studies.