Open Pan-Cancer Histology Dataset
In this work we present an experimental setup to semi automatically obtain exhaustive nuclei labels across 19 different tissue types, and therefore construct a large pan-cancer dataset for nuclei instance segmentation and classification, with minimal sampling bias. The dataset consists of 455 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources. In total the dataset contains 216.4K labeled nuclei, each with an instance segmentation mask. We independently pursue three separate streams to create the dataset: detection, classification, and instance segmentation by ensembling in total 34 models from already existing, public datasets, therefore showing that the learnt knowledge can be efficiently transferred to create new datasets. All three streams are either validated on existing public benchmarks or validated by expert pathologists, and finally merged and validated once again to create a large, comprehensive pan-cancer nuclei segmentation and detection dataset PanNuke.