In a recent study published in the journal Nature, researchers developed and evaluated the Providence Gigapixel Pathology Model (Prov-GigaPath), a whole-slide pathology foundation model, to achieve state-of-the-art performance in digital pathology tasks using large-scale real-world data and novel vision transformer architecture.
Study: A whole-slide foundation model for digital pathology from real-world data. Image Credit: Color4260 / Shutterstock
Background
Computational pathology can revolutionize cancer diagnostics through subtyping, staging, and prognostic prediction applications. However, current methods require extensive annotated data, which is costly and time-consuming. Self-supervised learning shows promise by using unlabelled data to pretrain models, reducing this need. Challenges include the limited and variable quality of available data, difficulty capturing local and global patterns, and restricted access to pre-trained models. Foundation models provide strong generalizability, which is essential for biomedical fields with abundant unlabeled data. Further research is necessary to improve these models’ generalizability and clinical applicability across diverse datasets and real-world settings.
About the study
The present study’s preprocessing of whole-slide images (WSIs) involved a pipeline for 171,189 Hematoxylin and Eosin (H&E)-stained and immunohistochemistry slides. Tissue segmentation filtered background regions using Otsu image thresholding. WSIs were resized to 0.5 μm per pixel and cropped into 256×256-pixel tiles, discarding tiles with less than 10% tissue coverage. Prov-GigaPath was pre-trained with the Vision Transformer (ViT) and Distillation of Knowledge in Networks version 2 (DINOv2) settings on 1,384,860,229 tiles. The slide encoder used the Long Sequence Network (LongNet) architecture. Pre-training, involving grid discretization, augmentations, and masked autoencoders, used 16 nodes with 4×80 GB A100 GPUs, completed in 2 days.
Prov-GigaPath was compared to the Hierarchical Image Pyramid Transformer (HIPT), Contrastive Learning-based Pathology Model (CtransPath), and Robust and Data-Efficient Generalization of Self-Supervised Machine Learning for Diagnostic Imaging (REMEDIS). HIPT, pre-trained on The Cancer Genome Atlas (TCGA) slides, used hierarchical image pyramid transformer architecture, while CtransPath combined Convolutional Neural Network (CNN) and SwinTransformer models. REMEDIS used a Resnet backbone with the Simple Framework for Contrastive Learning of Visual Representations (SimCLR) approach. Prov-GigaPath and these models were fine-tuned on diverse downstream tasks using Attention-Based Multiple Instance Learning (ABMIL) techniques for slide-level embeddings.
For mutation prediction, Providence Pathology (Prov-Path) data was used to construct tasks, including Pan-tumor Cancer (pan-cancer) biomarkers and gene mutations, evaluated using Area Under the Receiver Operating Characteristic Curve (AUROC) and Area Under the Precision-Recall Curve (AUPRC) in a 10-fold cross-validation. Cancer subtyping evaluations covered nine types, with models fine-tuned for 20 epochs.
Vision-language alignment involved creating 17,383 pathology WSI-reports pairs, processed with the Open-source Contrastive Language-Image Pre-training (OpenCLIP) codebase. Reports were cleaned using Generative Pre-trained Transformer (GPT)-3.5, and text embeddings were calculated with OpenAI’s text-embedding-ada-002 model. Zero-shot prediction tasks evaluated models like Multiple Instance Learning Zero-shot Transfer (MI-Zero), Biomedical Contrastive Language-Image Pre-training (BiomedCLIP), and Pathology-specific Language-Image Pre-training (PLIP) on subtyping and mutation status prediction, using settings and prompt templates from MI-Zero.
a, Flow chart showing the model architecture of Prov-GigaPath. Prov-GigaPath first serializes each input WSI into a sequence of 256 × 256 image tiles in row-major order and uses an image tile-level encoder to convert each image tile into a visual embedding. Then Prov-GigaPath applies a slide-level encoder based on the LongNet architecture to generate contextualized embeddings, which can serve as the basis for various downstream applications. b, Image tile-level pretraining using DINOv2. c, Slide-level pretraining with LongNet using masked autoencoder. [CLS] is the classification token.
Study results
The study demonstrated that Prov-GigaPath achieves superior performance across various digital pathology tasks compared to existing methods. Prov-GigaPath was pre-trained on Prov-Path, a large dataset derived from the Providence health system. This dataset includes 1,384,860,229 image tiles from 171,189 whole pathology slides from around 30,000 patients. The model employs the GigaPath architecture, leveraging the LongNet method for ultra-large-context modeling of gigapixel WSIs.
Prov-GigaPath demonstrated significant improvements in mutation prediction and cancer subtyping tasks. For instance, on the Lung Adenocarcinoma (LUAD)-specific five-gene mutation prediction task using TCGA data, Prov-GigaPath outperformed competing models with higher AUROC and AUPRC scores. Similar results were observed in pan-cancer 18-biomarker prediction and pan-cancer Tumor Mutation Burden (TMB) prediction tasks, showcasing the model’s robustness and generalizability across different datasets.
In addition to mutation prediction, Prov-GigaPath excelled in cancer subtyping tasks, outperforming state-of-the-art models in subtyping nine major cancer types. The substantial performance improvements underscore the effectiveness of combining local tile embeddings with global slide-level contextual information using LongNet.
Prov-GigaPath also explored vision-language processing by aligning pathology images with associated textual reports. The model achieved the best zero-shot classification results on Non-Small Cell Lung Cancer (NSCLC) and Colorectal Adenocarcinoma (COADREAD) subtyping tasks compared to three state-of-the-art pathology vision–language models. This indicates the advantage of slide-level alignment enabled by LongNet, leveraging real-world clinical data over other data sources like Twitter (X).
Conclusions
The study highlighted Prov-GigaPath’s potential to enhance clinical diagnostics and decision support in digital pathology. Its scalability and adaptability make it a promising tool for broader biomedical applications, facilitating efficient self-supervised learning from high-resolution images. Prov-Path includes 1,384,860,229 image tiles from 171,189 pathology slides of approximately 30,000 patients, making it significantly larger than TCGA. GigaPath uses LongNet5 for ultra-large-context modeling of gigapixel WSIs. Prov-GigaPath demonstrated state-of-the-art performance on pathomics, cancer subtyping, and vision-language processing tasks on both Providence and TCGA datasets. The model’s success suggests its applicability to broader biomedical domains for efficient self-supervised learning from high-resolution images.