1399 H&E-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset

Abstract Background The presence of lymph node metastases is one of the most important factors in breast cancer prognosis. The most common way to assess regional lymph node status is the sentinel lymph node procedure. The sentinel lymph node is the most likely lymph node to contain metastasized cancer cells and is excised, histopathologically processed, and examined by a pathologist. This tedious examination process is time-consuming and can lead to small metastases being missed. However, recent advances in whole-slide imaging and machine learning have opened an avenue for analysis of digitized lymph node sections with computer algorithms. For example, convolutional neural networks, a type of machine-learning algorithm, can be used to automatically detect cancer metastases in lymph nodes with high accuracy. To train machine-learning models, large, well-curated datasets are needed. Results We released a dataset of 1,399 annotated whole-slide images (WSIs) of lymph nodes, both with and without metastases, in 3 terabytes of data in the context of the CAMELYON16 and CAMELYON17 Grand Challenges. Slides were collected from five medical centers to cover a broad range of image appearance and staining variations. Each WSI has a slide-level label indicating whether it contains no metastases, macro-metastases, micro-metastases, or isolated tumor cells. Furthermore, for 209 WSIs, detailed hand-drawn contours for all metastases are provided. Last, open-source software tools to visualize and interact with the data have been made available. Conclusions A unique dataset of annotated, whole-slide digital histopathology images has been provided with high potential for re-use.

The presence of lymph node metastases is one of the most important factors in breast cancer prognosis. The most common strategy to assess the regional lymph node status is the sentinel lymph node procedure. The sentinel lymph node is the most likely lymph node to contain metastasized cancer cells and is excised, histopathologically processed and examined by the pathologist. This tedious examination process is timeconsuming and can lead to small metastases being missed. However, recent advances in whole-slide imaging and deep learning have opened an avenue for analysis of digitized lymph node sections with computer algorithms. Convolutional neural networks, a type of deep learning algorithm, are able to automatically detect cancer metastases in lymph nodes with high accuracy. To train deep learning models, large, well-curated datasets are needed.

Results
We released a dataset of 1399 annotated whole-slide images of lymph nodes, both with and without metastases, in total three terabytes of data. Slides were collected from five different medical centers to cover a broad range of image appearance and staining variations. Each whole-slide image has a slide-level label indicating whether it contains no metastases, macro-metastases, micro-metastases or isolated tumor cells. Furthermore, for 209 whole-slide images, detailed hand-drawn contours for all metastases are provided. Last, open-source software tools to visualize and interact with the data have been made available.

Conclusions
A unique dataset of annotated, whole-slide digital histopathology images has been provided with high potential for re-use.
Reviewer #1, question 1: The manuscript describes a dataset of H&E stained slides for breast cancer pathology, and is made available for the primary purpose of computer-based diagnostics and prognosis of breast cancer. Open datasets and benchmarks are very important tools with proven success in advancing different fields, especially related to pattern recognition, and it is likely that a clean and open dataset will be used by many, as the dataset is already being used and already making impact. The paper itself is a short well-written piece that describes the work well, and can be used as a base reference to this project. This reviewer believes that the work is useful and justifies publication, but would like to make several suggestions before the work is published. I made all efforts to give submit this report in a timely manner, and will be quick to respond should further discussion is required.
Answer 1: We are happy the reviewer agrees with our assessment that the CAMELYON dataset can be a highly useful benchmark for pattern recognition and machine learning techniques. We have addressed all the comments provided by the reviewer below.
Question 2: For some reason the paper, and especially the abstract, gives the impression that the dataset was created specifically for deep learning. I suggest to make it more general for computer-based diagnostics, as the data itself has very little to do with deep learning, and in fact any method can be tested using these data. Such methods can include also automatic model-driven methods that mimic the work of the pathologist, rather than the data-driven deep learning and other related approaches. Deep learning might be a "buzzword" in 2018, but five years from now there might be another buzzword, but the data will probably still be useful and relevant (H&E has been used for many years). Similar statements are also made in the Background section: "To train deep learning models, large, well-curated datasets are needed to both train these models and accurately evaluate their performance". The sentence is logically correct, but such data are required for training any machine learning model, not just deep learning.
Answer 2: We agree with the reviewer that the usefulness of the dataset is not limited to deep learning algorithms. As such we have generalized the text to focus on machine learning and pattern recognition models in general.

Question 3:
The claim that "deep learning have opened an avenue…" is an overstatement, as algorithms that are not based on deep learning demonstrated good recognition accuracy in pathology, in fact as early as the 1990's, without using deep learning. That whole sentence gives the impression that automatic classification of H&E slides for pathology is a new field, while it clearly isn't. I therefore recommend to weaken the statement or make it more general to machine learning. It seems to me that the term "deep learning" is confused with the term "machine learning".

Answer 3:
The reviewer is correct to point out that the analysis of H&E images with machine learning and image analysis methods has been around for several decades. We have updated the text to acknowledge this. We agree with the reviewer that there could be variability between pathologists in assessing H&E slides. However, when constructing the reference standard for CAMELYON, in case of uncertainty, the additional immunohistochemistry stain was always available. As indicated in the paper with reference 23, the observer variability in this stain is limited. We have added the following sentence to the paper to further clarify the annotations: Furthermore, this stain was also used to aid in drawing the outlines in both CAMELYON16 and CAMELYON17, which helps limit observer-variability. As both the H&E and IHC slides are digital, they can be viewed simultaneously, allowing observers to easily identify the same areas in both slides.
Sadly, during the construction of the dataset we did not monitor how often a correction was made by the experienced pathologists. After consulting with them they indicate that this was very rare. To give some number on the strength of the reference standard and potential observer variability, we can give two examples: Google hired a pathologist to check the CAMELYON16 dataset to assess false-positives they had in the challenge. This led to a correction of the reference standard in only 2 out of 399 cases. For CAMELYON17 we had the slides rechecked again by another pathology resident after receiving the reviews. The resident had access to all immunohistochemically-stained slides as well which led to a correction of 2 slides out of the 1000. So in total 4 slides were relabeled out of 1399 after subsequent extra inspections (< 0.3%), which we think shows that there is limited variability within the reference standard. Apparently, such algorithms can identify the imaging device, and in some cases even the technician acquiring the images, sometimes leading to good prediction accuracy achieved without solving the original problem (as shown in the links above). Therefore, it is not uncommon that models show good accuracy when using the same dataset separated to training and test data, but much lower accuracy when tested with data from a different set. That can even happen with images collected from the internet: http://ieeexplore.ieee.org/document/5995347/ The dataset described in this paper combines data from multiple medical centers and using different imaging devices, which is good. However, the dataset is still based on a fixed number of sources, and therefore algorithms showing good performance might still be limited to the specific data used in the dataset, and there is no guarantee that the same algorithm performs well also on data from sources it had not "seen" and trained with. As I proposed in the past, one way of solving a problem of this kind is to use data acquired from one center for training, and data from a different center for testing. Good results achieved using this experimental design indicate that the algorithm is not limited to a certain dataset. From the paper it seems that data from all centers were used for both training and testing, and therefore the current design does not test whether a model trained with the dataset can also annotate data coming from other centers that are not included in the dataset. I understand that after the grand challenge has already started and teams have already submitted their results it will be difficult to make a change in the design. However, a clear discussion about that limitation should be added. My understanding is that even with the current data, if researchers are aware of the issue they can separate the data into different centers and perform such experiment, testing how their algorithm performs on data from a center not used for training data.

Answer 6:
The reviewer is indeed completely right, we also have experience with algorithms learning unexpected things (like recognizing a software version of a scanner) when using a non-representative dataset. We hope to have mitigated that in CAMELYON17 by including data from five different centers with different scanners and staining protocols. We have added a section to the discussion covering this topic. We also indicate there that authors can conduct robustness experiments themselves as they know which center the training slides are from (and can thus omit one). The following text was added: A key example of implementation issues with respect to machine learning algorithms in medical imaging is generalization to different centers. In pathology centers can differ in tissue preparation, staining protocol and scanning equipment which each can have a profound impact on image appearance. In the CAMELYON dataset we included data from five centers and three different scanners. We are confident algorithms trained with this data will generalize well. Users of the dataset can even explicitly evaluate this as we have indicated for each image from which center it was obtained. By leaving out one center and evaluating performance on that center specifically the participants can assess the robustness of their algorithms.

Question 7:
The dataset is organized in the form of a grand challenge (like Kaggle, for instance), in which the authors do not release the annotation of the test data, but serve as the judges for teams that submit their results. The evaluation is done on the backend, and without the participation of the teams. The scientific motivation behind that practice should be discussed and explained. Kaggle is a very good service, and the practice of a competition is common in pattern recognition (e.g., ImageNet), but in the context of cancer diagnostics the impact and optimization of scientific return through the form of a grand challenge should be explained. The fact that it is a grand challenge should also be mentioned in the abstract.

Answer 7:
We have addressed this comment within the abstract and in the introduction with the following text: The concept of challenges in medical imaging and computer vision has been around for nearly a decade. In medical imaging it mostly started with the liver segmentation challenge at the annual MICCAI conference in 2007 and in computer vision the ImageNet Challenge is most widely known. The main goal of challenges, both in medical imaging and in computer vision, is to allow a meaningful comparison of algorithms. In scientific literature, this was often not the case as authors present results on their own, often proprietary, datasets with their own choice of evaluation metrics. In medical imaging this was specifically a problem as sharing medical data is often difficult. Challenges change this by making available datasets and enforcing standardized evaluation. Furthermore, challenges have the added benefit of opening up meaningful research questions to a large community who normally might not have access to the necessary datasets.

Question 8
In the context of that grand challenge, I was looking to find some description of how the results are evaluated, but did not find any information. There is indeed some information in the web site, but the information should also be given in the paper.

Answer 8:
We have added this information to the paper in the re-use potential section: Within CAMELYON we evaluate the algorithms based on a weighted Cohen's kappa at the pN-stage level. This statistics measures the categorical agreement between the algorithm and the reference standard where a value of 0 indicates agreement at the level of chance and 1 is perfect agreement. The quadratic weighting penalizes deviations of more than one category more severely.

Question 9
Page 4, line 52. The paragraph is a repetition of the previous section.

Answer 9
This paragraph specifically focusses on the quality of the scan. Scanning of slides can potentially fail due to dust on the slide or mechanical defects and as such, as a quality control measure, all slides were checked for these issues. We understand that this might have been unclear from this paragraph and have slightly rewritten it. Now it states: All glass slides included in the CAMELYON-dataset were part of routine clinical care and are thus of diagnostic quality. However, during the acquisition process scanning can fail or result in out-of-focus images. As a quality control measure, all slides were inspected manually after scanning. The inspection was performed by an experienced technician (Q.M and N.S. for center UMCU, M.H. or R.vd.L. for the other centers) to assess the quality of the scan and when in doubt a pathologist was consulted whether scanning issues might affect diagnosis.
Question 10 Page 6: "The dataset has also been used by companies experienced in machine learning application to be a ¬first foray into digital pathology, for example Google [22]." How is reference 22 related to Google?
Answer 10: We made a mistake with the reference in Latex, we have updated it to refer to the correct paper.
Reviewer #2, question 1: In this Data Note, the authors describe a large morphological study of digitised lymph node sections that could be used for exploring the ability of machine-learning algorithms to identify metastases on tissue sections. The lymph node specimens were collected from 5 different medical centres and the histopathological status was scored using TNM staging criteria. In the first study (CAMELYON16), a lab technician and a PhD student performed staging and expert pathologists confirmed the annotations. In a second study (CAMELYON17), a general pathologist staged the lymph node specimens, and detailed annotations were validated by one of two pathology residents. In addition, the authors describe the publicly available Automated Slide Analysis Platform (ASAP) software package that can be used to view whole-slide images, annotations and algorithmic results. The manuscript is well-written and I consider the CAMELYON dataset of great interest to the machine-learning community.
Answer 1: We thank the reviewer for his kind assessment of both the dataset and the paper. We have tried to address his comments below.

Question 2:
The CAMELYON dataset is available under Creative Commons License CC-BY-NC-ND. This implies that the data is free to share for non-commercial use. However, with this current license agreement the CAMELYON dataset may not be used for commercial purposes. Furthermore, the CC-BY-NC-ND license agreement implies that derivatives from these material, which could include segmentations of the original image data, may not be distributed commercially or non-commercially. This severely impinges on the utility of this dataset for machine-learning. The authors should consider changing the Creative Commons License agreement for the CAMELYON dataset so that re-use is encouraged.

Answer 2:
We agree with the reviewer and have contacted our partners and have agreed on licensing the dataset under CC-0. This is now also correctly reflected in the text.
Question 3: I would like more detail on how the polygon tool was used to manually delineate metastases. In particular, could the authors provide details of whether the immunohistochemically-labelled slides stained with anti-cytokeratin were used as a guide for annotating the adjacent H&E sections? Alternatively, were the H&E sections labelled directly without first inspecting the cytokeratin-labelled sections?
Answer 3: The immunohistochemically-stained slides were indeed used to guide annotations, but annotations were directly made on the H&E. Essentially the annotators used a 'mental registration' to identify the corresponding areas, which is usually not difficult as sections are adjacent. We have added the following sentence to the Data collection section to clarify this: Furthermore, this stain was also used to aid in drawing the outlines in both CAMELYON16 and CAMELYON17, which helps limit observer-variability. As both the H&E and IHC slides are digital, they can be viewed simultaneously, allowing observers to easily identify the same areas in both slides.

Question 4:
In addition, it would be good to know whether a consensus was reached between multiple pathologists in validating the hand-drawn annotations as this may impact on the ability of machine-learning algorithms to computationally identify metastases. Was there a consensus between multiple pathologists for all 399 hand-drawn contours produced from the CAMELYON16 dataset? Similarly, was there a consensus between multiple pathologists for all 50 hand-drawn contours that were produced from the CAMELYON17 dataset?
Answer 4: No, we did not obtain consensus annotations from multiple pathologists as this would be prohibitively costly in terms of time and available pathologists, given the size of the dataset. However, annotations were guided by immunohistochemically-stained slides and we know there is limited observer variability in those cases. Furthermore, all slides were double-checked by a pathologist or pathology resident with significant experience to prevent any accidental mistakes.
To give some number on the strength of the reference standard and potential observer variability, we can give two examples: Google hired a pathologist to check the CAMELYON16 dataset to assess false-positives they had in the challenge. This led to a correction of the reference standard in only 2 out of 399 cases1. For CAMELYON17 we had the slides rechecked again by another pathology resident after receiving the GigaScience reviews. The resident had access to all immunohistochemically-stained slides as well which led to a correction of 2 slides out of the 1000. So in total 4 slides were relabeled out of 1399 after subsequent extra inspections (< 0.3%), which we think shows that there is limited variability within the reference standard.
Question 5: Details of the primary and secondary antibodies used to stain for pan-cytokeratin have not been provided. If the various different medical centres used different antibodies, then this should be clearly stated in the manuscript as it may impact on the ability of machine-learning algorithms to process the immunohistochemically-labelled image data.
Answer 5: We have collected the information on the antibodies, which we have attached here. However, as the immunohistochemical slides are not part of the CAMELYON dataset, but were only used for the reference standard, we have not added this information to the paper. However, if the reviewer feels this is still valuable we would be happy to add it.
Question 6: Figure 4 shows the tissue mask overlay at low-resolution and it is very difficult to see how accurate the mask overlays the lymph node tissue. The authors should consider revising this figure to include higher-resolution images so that the mask overlay is clearly seen.

Answer 6:
We have added a higher resolution image. However, please note that the goal of that example is not to provide a very good tissue segmentation, but to show that only in a few lines of code a coarse segmentation can easily be created thanks to the library and visualized in the provided viewer.  [2]. While localized breast cancer has a ve-year survival rate of 99%, this drops to 85% in the case of regional (lymph node) metastases and only 26% in case of distant metastases. As such, it is of the utmost importance to establish whether metastases are present to allow adequate treatment and the best chance of survival. This is formally captured in the TNM staging criteria [3].
The rst step in determining the presence of metastases is the examination of the regional lymph nodes. Not only is the presence of metastases in these lymph nodes a poor prognostic factor by itself, it is also an important predictive factor for the presence of distant metastases [4]. In breast cancer the most common strategy to assess the regional lymph node status is the sentinel lymph node procedure [5,6]. Within this procedure a blue dye and/or radioactive tracer is injected near the tumor. The lymph node reached rst by the injected substance, the sentinel node, is most likely to contain the metastasized cancer cells and is excised. Subsequently, it is submitted for histopathological processing and examination by the pathologist. Pathologists examine a glass slide containing a tissue section of the lymph node stained with hematoxylin and eosin (H&E). Based solitary tumor cells or the diameter of clusters of tumor cells, metastases can be divided in one of three categories: macro-metastases, micro-metastases or isolated tumor cells (ITC). The size criteria for each of these categories is shown in Table 1. Based on the presence or absence of one or more of these metastasis an initial pathological N-stage (pN) is assigned to a patient. Based on this initial stage, in combination with characteristics of the main tumor, further lymph node dissection or axillary radiotherapy may be performed. These axillary lymph nodes are then also pathologically assessed to come to a nal pN-stage. pN categorization is mostly based on metastasis size and the number of lymph nodes involved, but also on the anatomical location of the lymph nodes. A small excerpt of the pN stage is shown in Table 2; for a full listing we refer to the 7th edition of the TNM staging criteria for breast cancer [7].
A key challenge for pathologists in assessing lymph node status is the large area of tissue that has to be examined to identify metastases that can be as small as single cells. Examples of a macro-metastasis, micro-metastasis, and ITC are shown in Figure 2. For sentinel lymph nodes at least three sections at di erent levels through the lymph node have to be examined and for non-sentinel lymph nodes one section of at least ten lymph nodes has to be examined [8,9]. This tedious examination process is time-consuming and pathologists may miss small metastases [10]. In the Netherlands, a secondary examination using an immunohistochemical staining for cytokeratin has to be performed if inspection of the H&E-slide identi es Table 2. Selection of N-stages for staging of breast cancer based on the 7th edition of the TNM-criteria.

N0
Cancer has not spread to nearby lymph nodes. N0(i+) The lymph nodes only contains ITCs N1mi Micro-metastases in 1 to 3 lymph nodes axillary N1a Cancer has spread to 1 to 3 lymph nodes axillary with at least one macro-metastasis N1b Cancer has spread to internal mammary lymph nodes, but this spread could only be found on sentinel lymph node biopsy N1c Both N1a and N1b apply N2a Cancer has spread to 4 to 9 lymph nodes under the arm, with at least one macro-metastasis N2b Metastases in clinically detected internal mammary lymph nodes in the absence of axillary lymph node metastases no metastases. However, even in this secondary examination, metastases can still be missed [11].
Nowadays, advances in whole-slide imaging and machine learning have opened an avenue for analysis of digitized lymph nodes sections with computer algorithms. Whole-slide imaging is a technique where high-speed slide scanners digitize glass slides at very high resolution (e.g. 240 nm per pixel). This results in images with a size in the order of 10 gigapixels, typically called whole-slide images (WSI). This large amount of data makes WSIs ideally suited for analysis with machine learning algorithms. Although application of machine learning algorithms to digitized pathology data have appeared as early as 1994 [12], whole-slide images have only appeared since the early 2000s. Since then, many papers have described the use of machine learning algorithms in whole-slide images, for example for breast or prostate cancer classi cation [13,14]. Over the past ve years, so-called deep learning algorithms, like convolutional neural networks (CNNs), have become incredibly popular . For example, we were the rst to show that training CNNs to detect cancer metastases in lymph nodes was possible and potentially could result in improved e ciency and accuracy of histopathologic diagnostics [15].
To train machine learning models, large, well-curated datasets are needed to both train these models and accurately evaluate their performance. To allow the broader computer vision community to replicate and build on our results, we publicly released a large dataset of annotated whole-slide images of lymph nodes, both with and without metastases in the context of the CAMELYON16 and CAMELYON17 challenges (CAncer MEtastases in LYmph nOdes challeNge) [16,17].
The concept of challenges in medical imaging and computer vision has been around for nearly a decade. In medical imaging it mostly started with the liver segmentation challenge at the annual MICCAI conference in 2007 [18] and in computer vision the ImageNet Challenge is most widely known [19]. The main goal of challenges, both in medical imaging and in computer vision, is to allow a meaningful comparison of algorithms. In scienti c literature, this was often not the case as authors present results on their own, often proprietary, datasets with their own choice of evaluation metrics. In medical imaging this was speci cally a problem as sharing medical data is often di cult. Challenges change this by making available datasets and enforcing standardized evaluation. Furthermore, challenges have the added bene t of opening up meaningful research questions to a large community who normally might not have access to the necessary datasets.
The CAMELYON dataset was collected at di erent Dutch medical centers to cover the heterogeneity encountered in clin-Geert Litjens et al. | 3 ical practice. It contains a total of 1399 WSIs, resulting in approximately three terabytes of image data. We released a part of the dataset with the reference standard (i.e. the training set) to allow other groups to build algorithms to detect metastases. Subsequently, the rest of the dataset was released without reference standard (i.e. the test set). Participating teams could submit their algorithm output on the test set to us, after which we evaluated their performance on a prede ned set of metrics to allow fair and standardized comparison to other teams. To enable participation of teams that are not familiar with wholeslide images, we released a publicly available software package for viewing WSIs, annotations and algorithmic results, dubbed the Automated Slide Analysis Platform (ASAP) [20]. This paper describes the CAMELYON dataset in detail, and covers the following topics: • Sample collection • Slide digitization and conversion • Challenge dataset construction and statistics • Instructions on the use of ASAP to view and analyze slides • Suggestions for data re-use

Data Description
The CAMELYON dataset is a combination of the WSIs of sentinel lymph node tissue sections collected for the CAMELYON16 and CAMELYON17 challenges, which contained 399 WSIs and 1000 WSIs, respectively. This resulted in a total of 1399 unique WSIs and a total data size of 2.95 terabytes. The dataset is currently publicly available after registration via the CAMELYON17 website [17]. At the time of writing it has been accessed by over 1000 registered users worldwide. It has been licensed under the Creative Commons CC0 license.

Data collection
Collection of the data was approved by the local ethical committee of the Radboud University Medical Center (RUMC) under 2016-2761 and the need for informed consent was waived. Data was collected at ve di erent medical centers in the Netherlands: the RUMC, the Utrecht University Medical Center Table 5. Patient-level characteristics for the CAMELYON17 part of the dataset. Train  Test  pN0 pN0 i+ pN1 mi  pN1 pN2   CWZ  20  20  4  3  5  6  2  LPON  20  20  5  2  3  5  5  RST  20  20  4  2  5  5  4  RUMC  20  20  3  3  3  6  5  UMCU  20  20  8  1  5  3  3   Total  100  100  24  12  20  25  19 (UMCU), the Rijnstate Hospital (RST), the Canisius-Wilhelmina Hospital (CWZ), and LabPON (LPON). An example of digitized slides from these centers can be seen in Figure 1. Initial identi cation of cases eligible for inclusion was based on local pathology reports of sentinel lymph node procedures between 2006 and 2016. The exact years included varied from center to center, but did not a ect data distribution or quality. After the lists of sentinel node procedures and the corresponding glass slides containing H&E-stained tissue sections were obtained, slides were randomly selected for inclusion. As the vast majority of sentinel lymph nodes are negative for metastases, selection was strati ed for the presence of macrometastases, micro-metastases and ITCs based on the original pathology reports. This was done to obtain a good representation of di ering metastasis appearance without the need for an excessively large dataset.

Center Total patients Stages (Train)
Data was acquired in two stages, corresponding to the time periods for organization of the CAMELYON16 and CAMELYON17-challenge. Within the CAMELYON16 challenge, only data from the RUMC and UMCU was acquired and no slides containing only ITCs were included. For CAMELYON17 data was included from all ve centers and glass slides containing only ITCs were obtained as well. A categorization of the slides can be found in Tables 3 and 4.
After selection of the glass slides, they were digitized with di erent slide scanners such that scan variability across centers was captured in addition to H&E-staining procedure variability. The slides from RUMC, CWZ and RST were scanned with the 3DHistech Pannoramic Flash II 250 scanner at the RUMC. At the UMCU slides were scanned with a Hamamatsu NanoZoomer-XR C12000-01 scanner and at LPON with a Philips Ultrafast Scanner.
As all slides are initially stored in an original vendor format which makes re-use challenging, slides were converted to a common, generic TIFF (Tagged Image File Format) using an open-source le converter, part of the ASAP package [20]. As there are no open-source tools to convert the iSyntax format produces by the Philips Ultrafast Scanner a proprietary converter was used to convert les to a special TIFF format [21], which can be read by the open-source package OpenSlide [22] and the ASAP package [20]. Some basic descriptors are shown in Table 6.   After digitization, the reference standard for each slide needed to be established. The reference standard for each WSI consisted of a slide level label indicating the largest metastasis within a slide (i.e. no metastasis, macro-metastasis, micro-metastasis or ITC). Furthermore, for all 399 WSIs which were part of the CAMELYON16 challenge and an additional 50 WSIs from the CAMELYON17-challenge detailed contours were drawn along the boundaries of metastases within the WSI. For the 50 slides of the CAMELYON17 challenge, 10 slides from each center were used to allow users of the dataset to analyze metastasis appearance di erences across di erent centers.
Initial slide level labels were assigned based on the pathology reports obtained from clinical routine. For the CAME-LYON16 part of the dataset all slides were subsequently examined and metastases outlined by an experienced lab technician (M.H.) and a clinical PhD student (Q.M.). Afterwards, all annotations were inspected by one of two expert breast pathologists (P.B. or P.v.D.). Some slides contained two consecutive tissue sections of the same lymph node, in which case only one of the two sections was annotated as this did not a ect the slide level label. In total 15 slides may contain unlabeled metastatic areas and are indicated via a descriptive text le which is part of the dataset. For the entire dataset, when the slide level label was unclear during the inspection of the H&E-stained slide, an additional WSI with a consecutive tissue section, immunohistochemically (IHC) stained for cytokeratin, was used to con rm the classication. Furthermore, this stain was also used to aid in drawing the outlines in both CAMELYON16 and CAMELYON17, which helps limit observer-variability. As both the H&E and IHC slides are digital, they can be viewed simultaneously, allowing observers to easily identify the same areas in both slides. This stain is also be used in daily clinical pathology practice to resolve diagnosis in the case of metastasis-negative H&E [23,24]. An example of an H&E WSI and the corresponding consecutive cytokerain immunohistochemical section is shown in Figure 3.
In the CAMELYON17 dataset, after establishing the reference standard, slides were divided into arti cial patients, covering the di erent pN-stages (see Table 2). Each arti cial patient only had WSIs from one center. For each arti cial patient in the training part of the dataset the pN-stage and the slide level labels were provided. This was done to assess the potential of participating algorithms within the challenge to perform automated pN-staging. However, all WSIs can be used independently of their patient level labels.
After the dataset and reference standard were established we uploaded the entire dataset to Google Drive and to BaiduPan. These two options were chosen to reach as wide an audience as possible, given that Google Drive is not accessible everywhere (e.g. People's Republic of China). A link to the data was shared with participants after registration at the CAMELYON-websites [16,17].

Data validation and quality control
All glass slides included in the CAMELYON-dataset were part of routine clinical care and are thus of diagnostic quality. However, during the acquisition process scanning can fail or result in out-of-focus images. As a quality control measure, all slides were inspected manually after scanning. The inspection was performed by an experienced technician (Q.M and N.S. for center UMCU, M.H. or R.vd.L. for the other centers) to assess the quality of the scan and when in doubt a pathologist was consulted whether scanning issues might a ect diagnosis.
Due to the inclusion of IHC for establishing the reference standard the chance of errors being made can be considered limited, as pathologists make few mistakes in identifying metastases with IHC [25]. Furthermore, all slides were checked twice. However, to further ensure the quality of the reference standard we looked at algorithmic results submitted to the challenge to identify slides where the best performing algorithms disagreed with the reference standard. This led to a correction of the reference standard in 3 of the 1399 slides.

Tools for data use
Several tools are available to visualize and interact with the CAMELYON-dataset. Here we will present examples of how to use the data with an open-source package developed by us, called ASAP (Automated Slide Analysis Platform) [20]. Other open-source packages are also available, such as OpenSlide [26], but those do not contain functionality for reading annotations or storing image analysis results.
• Project name: Automated Slide Analysis Platform (ASAP) • Project home page: https://github.com/GeertLitjens/ASAP • Operating system(s): Linux, Windows • Programming language: C++, Python • Other requirements: CMake (www.cmake.org) • License: GNU GPL v2.0 ASAP contains several components, of which one is a viewer/annotation application (Figure 4). This can be started via the ASAP executable within the installation folder of the package. After opening an image le from the CAMELYONdataset one can explore the data via a 'Google Maps'-like interface. The provided reference standard can be loaded via the annotation plugin. Furthermore, new annotations can be made with the provided annotation tools. Last, the viewer is not limited to les from CAMELYON-dataset but can visualize most WSI formats.
In addition to the viewer application and C++ library to read and write WSI images, we also provide Python-wrapped modules. To access the data via Python the following code-snippet can be used. The annotations are provided in human-readable XML format and can be parsed using the ASAP-package. However, other XML reading libraries can also be used. Annotations are stored as polygons. Each polygon consists of a list of (x, y) coordinates at the highest resolution level of the image. Annotations can be converted to binary images via the following code-snippet.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 print(annotation.getArea()) print(annotation.getNumberOfPoints()) print(annotation.getCoordinate(0).getX()) # Convert the annotations to an indexed image annotation_mask = mir.AnnotationToMask() label_map = { metastases : 1, normal : 2} output_path = patient_010_node_4_labels.tif annotation_mask.convert(annotation_list, output_path, image.getDimensions(), image.getSpacing(), label_map) The Python package can also be used to perform image processing or machine learning tasks on the data and write out an image result. The code-snippet below performs some basic thresholding to generate a background mask. These results can then subsequently be visualized using the viewer component of ASAP, which also supports oating point images. An example of the code-snippet result can be seen in Figure 4c.   , (512,512), order=0, mode="constant", preserve_range=True).astype("ubyte") writer.writeBaseImagePart(res_tl.flatten()) writer.finishImage() The ASAP package also supports writing your own image processing routines and integrating them as plugins into the viewer component. Some existing examples like color deconvolution and nuclei detection are provided.

Re-use potential
The CAMELYON dataset is currently still being used within the CAMELYON17 challenge, which is open for new participants and submissions. In this context, the dataset enables testing new machine learning and image analysis strategies against the current state-of-the-art. Within CAMELYON we evaluate the algorithms based on a weighted Cohen's kappa at the pN-stage level [27]. This statistics measures the categorical agreement between the algorithm and the reference standard where a value of 0 indicates agreement at the level of chance and 1 is perfect agreement. The quadratic weighting penalizes deviations of more than one category more severely.
Conclusions arising from such experiments may have signicance for the broader eld of computational pathology, rather than being restricted to this particular application. For example, experiments with weakly supervised machine learning in histopathology may bene t from the CAMELYON dataset, with an established baseline based on fully supervised machine learning.
The dataset has also been used by companies experienced in machine learning application to be a rst foray into digital pathology, for example Google [28] . Because of its extent, observer experiments with pathologists may be performed to assess the value of algorithms within a diagnostic setting. For example, a comparison of algorithms competing in the CAMELYON16-challenge to pathologists in clinical practice was recently published [29]. Experiments with the dataset may serve to identify relevant issues with implementation, validation and regulatory a airs with respect to computational pathology.
A key example of implementation issues with respect to machine learning algorithms in medical imaging is generalization to di erent centers. In pathology centers can di er in tissue preparation, staining protocol and scanning equipment which each can have a profound impact on image appearance. In the CAMELYON dataset we included data from ve centers and three di erent scanners. We are con dent algorithms trained with this data will generalize well. Users of the dataset can even explicitly evaluate this as we have indicated for each image from which center it was obtained. By leaving out one center and evaluating performance on that center speci cally the participants can assess the robustness of their algorithms.
We believe the usefulness of the dataset also extends beyond it's initial use within the CAMELYON-challenge. For example, it can be used for evaluation of color normalization algorithms, and for cell detection/segmentation algorithms.

Declarations
List of abbreviations ASAP Automated Slide Analysis Platform H&E Hematoxylin and eosin IHC Immunohistochemistry ITC Isolated tumor cells WSI Whole-slide image