Multi-omic dataset of patient-derived tumor organoids of neuroendocrine neoplasms

Abstract Background Organoids are 3-dimensional experimental models that summarize the anatomical and functional structure of an organ. Although a promising experimental model for precision medicine, patient-derived tumor organoids (PDTOs) have currently been developed only for a fraction of tumor types. Results We have generated the first multi-omic dataset (whole-genome sequencing [WGS] and RNA-sequencing [RNA-seq]) of PDTOs from the rare and understudied pulmonary neuroendocrine tumors (n = 12; 6 grade 1, 6 grade 2) and provide data from other rare neuroendocrine neoplasms: small intestine (ileal) neuroendocrine tumors (n = 6; 2 grade 1 and 4 grade 2) and large-cell neuroendocrine carcinoma (n = 5; 1 pancreatic and 4 pulmonary). This dataset includes a matched sample from the parental sample (primary tumor or metastasis) for a majority of samples (21/23) and longitudinal sampling of the PDTOs (1 to 2 time points), for a total of n = 47 RNA-seq and n = 33 WGS. We here provide quality control for each technique and the raw and processed data as well as all scripts for genomic analyses to ensure an optimal reuse of the data. In addition, we report gene expression data and somatic small variant calls and describe how they were generated, in particular how we used WGS somatic calls to train a random forest classifier to detect variants in tumor-only RNA-seq. We also report all histopathological images used for medical diagnosis: hematoxylin and eosin–stained slides, brightfield images, and immunohistochemistry images of protein markers of clinical relevance. Conclusions This dataset will be critical to future studies relying on this PDTO biobank, such as drug screens for novel therapies and experiments investigating the mechanisms of carcinogenesis in these understudied diseases.

Comment 3. The primary site of mLCNEC23 is unknown.Could you infer its primary site based on gene expression patterns or driver mutations?Answer: We now mention p. 8 that molecular data of mLCNEC23 did support its LCNEC nature---clustering with other LCNEC and presence of mutations characteristic of LCNEC such as in gene TP53---but that since LCNEC from multiple organs clustered together and had similar driver mutations, the organ of origin could not be determined based on molecular data.
Comment 4. I have concerns about the generalizability of your random forest model because it was trained using only 22 somatic mutations.Could you assess your prediction model using publicly available datasets of cancer genomes (e.g., TCGA)?Answer: We now provide additional elements to support the generalization of our model pp.7-8.We show in a new Figure S1 that the random forest model trained on these 22 somatic mutations from known neuroendocrine neoplasm cancer genes produces similar accuracy when tested on the 223 somatic and 395 non-somatic mutations from all other recurrently mutated genes from the cohort (AUC=0.90,sensitivity up to 73% with a specificity above 87%).We also mention that we have used a similar approach in the past and validated it on WGS tumor-only samples, which allowed to "classify variants called from tumor-only WGS data as somatic or germline with high performance (accuracy greater than 92%; di Genova et al.GigaScience 2023)".

Reviewer #2
Alcala et al., did an excellent work on rare cancer type by creating PDTOs molecular fingerprint which has a direct impact for researcher working on these rare cancer type.As a data note, this is excellent resource and covering huge gap in this rare cancer field.
These PDTOs holds high impact specially for such cancers which are slow growing and not easy culture in lab.Authors covered details regarding each technique used in this study and figures are clear to understand with exceptional writing.Answer: We thank the reviewer for their positive assessment of our work.
Minor comments: Comment 1. Did authors compare the PDTOs to tumor molecular dataset ?This will be the key to understand how closely and qualitatively PTDOs are related to actual tumor datasets molecular profile.It is not clear in the current version and it will be helpful to readers to decide whether PTDOs molecular fingerprint system are valuable to them.This is not required for this manuscript to address but a note will be helpful to make valulabe decision to use such resources and with what limitations.Answer: We agree with the reviewer that although this is beyond the scope of such a data note paper, the comparison of PDTOs and parental tumors is key.Thus, we have added on p. 8 a summary of the results from the associated research article (Dayton, Alcala et al. In press), where this comparison is thoroughly investigated.In addition, in this data note we report all the scripts used to perform the thorough comparison between PDTOs and parental tumors presented in the associated research article.In particular, we mention the location of the scripts showing that the organoids faithfully represent the gene expression profile and genomic profile of their parental tumors (freely available at https://github.com/IARCbioinfo/MS_panNEN_organoids/tree/main/Rscripts ).
Comment 2. Authors covered longitudinal samples in this system for 1 to 2 timepoints.What changes did they observe (molecularly) looking at this data from a longitudinal timepoints view will be helpful for readers.Also, based on author's experience for longitudinal sampling, do authors have key suggestions for researcher ? a brief discussion will be helpful.Answer: This is also a key aspect covered in the associated research article.We now summarize in this data note p. 8 the main results from the associated research article, in particular mentioning that " PDTOs preserve the genetic diversity and clonal architecture of their parents across long periods of time (6 months to more than a year)".We also report all the scripts for the longitudinal evolutionary analyses of the samples, so researchers can use it as inspiration for their own studies.Comment 3. Authors did comprehensive small variant analysis from WGS and RNAseq.Did you authors find known somatic variations for these samples ?mainly comparing against the known published mutational landscape.A note of this will be helpful.Answer: We now mention p. 8 that "both variants identified with WGS and variants identified with RNA-seq include driver mutations in key recurrently altered LCNEC driver genes such as TP53 (mutated in 5/5 LCNEC) and STK11 (mutated in 3/5 LCNEC).We also identified mutations or structural variants in known driver genes in all but one neuroendocrine tumors (17/18), but as previously reported, they involve multiple genes instead of recurrently mutated genes (Fernandez-Cuesta et al. 2014).This confirms that PDTOs recapitulate the genomic profile of neuroendocrine neoplasms".We also now mention that we report all the scripts necessary to perform these analyses p. 8. Comment 4. A comment on limitations of PTDOs and molecular fingerprint created from such PDTOs will be valuable.Answer: We now mention in the re-use potential section p. 9 the limitations of this PDTO biobank, notably "that the slow passage time of low-grade PDTOs makes them appropriate models to study the biology of neuroendocrine tumors, but challenges their use for drug testing.This is particularly true of small intestine NETs, which were only short term cultures that did not grow past four passage.Finally, as noted in most molecular studies of PDTOs (Lee et al Cell 2018), one of the main differences between PDTOs and their parental tumors is the absence of microenvironment.Future work would ideally focus on creating cocultures of PDTOs and immune cells to remedy this shortcoming".
Comment 5. Authors briefly comment on using such molecular datasets from PDTOs and combining with other datasets to improve on power statistics to discover informative molecular features of these cancers.This points towards my first point on how similar PDTOs are to tumor molecular profile.Answer: We now provide in the github repository the expression matrix processed exactly as our past (Gabriel et al.Gigascience 2020) and future studies (Sexton-Oates et al.In prep), and mention its public location in the re-use potential section.We refer the reviewer to the answer to comment 1 for details about similarities with reference tumors.

Reviewer #3
The authors conducted a study where they generated multi-omics datasets, including whole-genome sequencing and RNA sequencing , for rare neuroendocrine tumors in the lungs, small intestine, and large cells.They used patient-derived tumor organoids and performed quality control analysis on the datasets.Additionally, they developed a random forest classifier specifically for detecting mutations in the RNA-seq data.The pipeline used in this study is well-organized, but I have a few queries that I would like to clarify before recommending it for publication.
Major concerns: Comment 1.The data processing and quality control procedures would be valuable for other researchers working with similar datasets.It would be beneficial to add these procedures to the GitHub repository (https://github.com/IARCbioinfo/MS_panNEN_organoids).Furthermore, it would be helpful to provide insights into what constitutes good quality reads, such as the number of unique reads and the ratio of duplicate reads.Answer: We thank the reviewer for noting this oversight in our code availability.We have now added all the command lines used for data processing the RNA-seq and WGS in the github repository (see readme https://github.com/IARCbioinfo/MS_panNEN_organoids)and mention them in the code availability section p. 10.Note that we also provide the processed data per the other reviewers' suggestions (gene expression and small variants, see answer to comment 2 from reviewer 1).Finally, we now also mention the general guidelines from software fastQC regarding the quality of reads p. 4. Comment 2. Regarding the random forest (RF) model, it is mentioned that there are 10 features.Could you clarify if these features are from the public information, or are all the features extracted solely from the RNA-seq data?Also, does the RF model work for WGS data as well?Was there any specific design implemented to address the issue of imbalanced positive and negative samples?Answer: We now explicitly mention p. 6 which of the 10 features come from the RNA-seq data (4/10 features) and which are annotations coming from public databases (6/10 features).We also mention p. 7 that indeed a similar model was used in Di Genova et al. on WGS data to classify variants from tumor-only samples.Regarding the issue of imbalanced positive and negative samples, we now mention pp.6-7 that we preferred to keep this imbalance in the training set to force the algorithm to take into account the fact that most variants are not somatic, and thus having a very good specificity is key to avoid large false discovery rates.
Comment 3. RNA-seq are not used to generate the gene expression here, which would waste important information.Answer: We now make it explicit that we did generate gene expression data and that it is available on the github and EGA repositories (see revised data availability section p. 9).
Minor concerns: Comment 4. In Figure 6C, what does "Mean minimum depth" refer to?Answer: We now describe in the figure legend that "Mean minimum depth" refers to the "tree depth (1: root, value $>>1$: leaves) of the first time the feature is used for classification, averaged across all trees; low values indicate features often used at the root and thus particularly important".
Comment 5. Is the most important feature identified by the RF model a good predictor?Answer: We now mention p. 8 that while this feature is particularly important, "using the most important feature alone (the REVEL score) led to a much lower accuracy, consistent with the importance of other features such as TLOD and pathogenic annotations (COSMIC, InterVar)".