30  Image analysis

30.1 Preamble

30.1.1 Introduction

Biomedical image analysis encompasses a wide range of imaging modalities, including computed tomography (CT) scans, magnetic resonance imaging (MRI), immunofluorescence, and histological staining such as hematoxylin and eosin (H&E). In this chapter, we focus specifically on the analysis of H&E-stained histopathological images, which are routinely used in clinical diagnostics due to their low cost and ability to reveal rich morphological details.

The R and Bioconductor ecosystems offer several tools and workflows to work with digital pathology data. However, a central question remains: what constitutes the most valuable information in these images? Is it the image itself, or the biological and clinical insights that can be computationally extracted from it?

In previous chapters, spatial transcriptomics and its power in linking gene expression to tissue architecture is discussed. Here, we continue along this line by exploring how histopathological images can be leveraged to extract meaningful features that serve as input for integrative analyses in cancer research and beyond.

30.2 Background

30.2.1 H&E images

Histology is the study of normal tissue structure, whereas pathology focuses on identifying abnormalities in diseased tissues – both commonly rely on hematoxylin and eosin (H&E) staining to visualize cellular and tissue morphology. This is why the term histopathology is used to describe the microscopic examination of diseased tissue.

Hematoxylin and eosin (H&E) staining is one of the most widely used and cost-effective techniques in histopathology. It provides essential morphological information by staining cell nuclei (hematoxylin) and cytoplasmic or extracellular components (eosin), allowing for clear visualization of tissue architecture. Due to its low cost, high availability, and compatibility with routine clinical workflows, H&E staining is the standard first step in pathological diagnosis.

In recent years, digital pathology has enabled the large-scale acquisition and analysis of H&E-stained whole-slide images (WSIs), fostering the development of computational methods to extract quantitative features and support data-driven research in cancer and other diseases. Recent studies have demonstrated how histopathological images can be used to predict genomic alterations, transcriptional states, or even patient outcomes using machine learning (Madabhushi and Lee 2016; Schmauch et al. 2020; Bergstrom et al. 2024). Examples include HE2RNA, which predicts RNA-Seq profiles from images (Schmauch et al. 2020), and models that infer spatial transcriptomics data from histology (Pizurica et al. 2024).

Digital pathology workflows rely on high-resolution whole-slide images (WSIs) generated by proprietary scanners from different vendors. These WSIs are saved in specific file formats, each corresponding to a particular scanner type. Understanding these formats is essential for designing interoperable and reproducible computational pipelines.

The scanner brands and their respective file formats commonly encountered in digital pathology include:

  • Aperio: .svs, .tif
  • DICOM-compatible scanners: .dcm
  • Hamamatsu: .vms, .vmu, .ndpi
  • Leica: .scn
  • MIRAX: .mrxs
  • Philips: .tiff
  • Sakura: .svslide
  • Trestle: .tif
  • Ventana: .bif, .tif
  • Zeiss: .czi
  • Generic tiled TIFF: .tif

Each scanner uses a unique tiling scheme and metadata structure to support rapid visualization and efficient storage. For instance, Aperio’s .svs format uses a pyramidal tiling strategy with multiple image resolutions stored within a single file.

Figure 30.1: The pyramidal structure of WSI, resulting from different levels of resolution

Several publicly available repositories, such as The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA), provide free access to large-scale genomic and imaging datasets. In the following sections, we will explore these resources in more detail.

30.2.2 TCGA Data

The Cancer Genome Atlas (TCGA) includes a collection of 11,765 diagnostic whole-slide images from 9,640 patients across 33 cancer types (Tomczak, Czerwińska, and Wiznerowicz 2015). These histopathological images represent only one component of TCGA’s broader multi-omics repository. Alongside WSIs, TCGA provides a rich array of molecular and clinical data, including gene expression (RNA-Seq), somatic mutation profiles (whole-exome sequencing), DNA methylation, copy number alterations, protein expression (RPPA), and comprehensive clinical annotations. This multidimensional dataset facilitates integrative analyses that connect tissue morphology with molecular alterations and clinical outcomes.

TCGA includes two main types of histological slides: flash frozen and formalin-fixed paraffin-embedded (FFPE). Flash frozen slides are typically produced intraoperatively in a cryolab to help surgeons assess tumor margin status. While this method ensures close proximity to the tissue used for genomic extraction, it often introduces morphological artifacts such as tissue cracking and holes due to freezing, resulting in a “Swiss cheese” appearance that limits their utility for computational analysis.

Conversely, FFPE slides, considered the gold standard in diagnostic histopathology, are created by chemically fixing tissue in formalin and embedding it in paraffin wax before slicing. These slides preserve fine tissue architecture and provide visually high-quality samples, making them more suitable for algorithmic analysis. However, because of spatial heterogeneity in tumors, FFPE samples may not precisely correspond to the regions used for genomic profiling.

Tissue submitted to TCGA undergoes a structured workflow at the Biospecimen Core Resource (BCR). Two slides – designated top-section (TS) and bottom-section (BS) – are reviewed to evaluate tumor content and necrosis percentage. The central portion of the sample is reserved for RNA and DNA extraction. Additionally, one or more diagnostic FFPE slides are submitted to confirm histopathological diagnosis. These diagnostic slides originate from the same tumor, but the spatial and molecular correspondence to the genomics-extracted tissue is often uncertain. Thus, researchers must consider a tradeoff between image quality and genomic adjacency when designing image-based studies using TCGA data (Cooper et al. 2018).

30.2.3 TCIA Data

The Cancer Imaging Archive (TCIA) is a large-scale open-access repository that provides a comprehensive collection of medical images of cancer, including radiological scans (e.g., CT, MRI, PET) and histopathological images. TCIA is a critical resource for cancer imaging research as it includes richly annotated datasets with accompanying clinical, genomic, and pathological metadata. It supports a wide range of applications, including image-based biomarker discovery, radiogenomics, and multi-modal integration studies. Researchers can access TCIA datasets through its user interface or programmatically via APIs, which facilitate the retrieval and processing of large volumes of image data in a reproducible and automated manner.

30.3 Feature extraction

Histopathological images contain a vast amount of information, but to make them usable in computational analysis, this information needs to be translated into numerical features. Broadly speaking, two types of features can be extracted: human-interpretable features, which capture biologically meaningful descriptors, and latent embeddings, which encode complex patterns through deep learning. Together, these approaches enable both biological interpretation and powerful predictive modeling.

30.3.1 Human-interpretable features

Human-interpretable features are designed to capture descriptors that pathologists can relate to established morphological concepts. They are typically obtained after cell or nucleus segmentation (see Chapter 17) and can be extracted using image analysis libraries such as scikit-image or Squidpy. Examples include:

  • Morphological features (e.g., area, perimeter, eccentricity, solidity) that describe nuclear and cellular shapes.
  • Intensity features (e.g., mean, variance, minimum, maximum) computed on grayscale or color channels of the image.
  • Spatial features, such as nearest-neighbor distances between nuclei, which help characterize the cellular microenvironment.

These features are particularly useful when the goal is to connect image-derived measurements with biological mechanisms, as they provide an interpretable bridge between raw image data and pathology expertise.

30.3.2 Latent embeddings

A complementary strategy involves the use of embeddings generated by deep learning models. These models are often trained on millions of histological image tiles using self-supervised or contrastive learning, and they produce high-dimensional feature vectors that capture subtle morphological patterns not easily recognized by the human eye.

Such embeddings have been successfully applied to tasks like unsupervised clustering, patient stratification, prediction of genomic alterations, and survival analysis. For example, Prov-GigaPath (Xu et al., n.d.) is a foundation model specifically developed for histopathology that provides robust and generalizable embeddings. These embeddings can be aggregated across image tiles or at the slide level and then integrated into downstream analyses.

By combining interpretable features with latent embeddings, researchers can achieve a comprehensive representation of tissue morphology: interpretable features anchor findings in biological relevance, while embeddings capture rich high-dimensional structure that enhances predictive performance.

30.4 Interfaces in R

30.4.1 imageTCGA

imageTCGA is an R/Bioconductor package designed to provide an interactive Shiny application for exploring the TCGA Diagnostic Image Database. This application allows users to filter and visualize metadata, geographic distribution, and other relevant statistics related to TCGA diagnostic images.

Future updates to the package (currently under development) will also provide direct access to the features described in the previous section, allowing users to download them without additional preprocessing steps.

After installing the package from Bioconductor, you can run the Shiny application by executing the following command in R:

Code
imageTCGA::imageTCGA()
Figure 30.2: Graphical interface imageTCGA shiny app

This will open the application in your default web browser, where you can explore 11,765 diagnostic images from 9,640 patients, filtering them based on various clinical and pathological parameters.

The Shiny application allows filtering by any of the available columns in the dataset. For instance, you can filter for a specific tumor type, such as ovarian cancer (107 diagnostic images).

Figure 30.3: Filtering images in imageTCGA Shiny application

You can generate R code to download the selected images to your local machine by clicking the blue “Generate R Code” button. This utilizes the GenomicDataCommons package.

In the example below, Ovarian Cancer images have been selected:

Figure 30.4: Generate R code in imageTCGA

The Shiny application provides an interactive geographic visualization, displaying the origin of diagnostic images at the center, country, and state level.

For example, in the image below, GBM tumors have been selected. Additionally, summary statistics such as the number of cities and states are reported alongside a bar plot of the state distribution.

Figure 30.5: Geographic distribution of GBM tumor imageTCGA Shiny app

30.4.2 TCIAAPI

The TCIAAPI package provides an interface to the Cancer Imaging Archive (TCIA) API. The TCIA API allows users to programmatically access the TCIA data. The package provides functions to obtain an access token, download SVS images, and retrieve metadata from the TCIA API.

The TCIA API requires an access token to access the data. The tcia_access_token function retrieves the access token from the TCIA API. By default, it is configured to obtain a public token. Note that the token expires after a certain period of time and must be refreshed.

Code
tcia_access_token() |> httr2::obfuscate()

Note that we use httr2::obfuscate to hide the token from the output.

The tcia_svs_info function retrieves metadata information on SVS images from the TCIA API. The function requires a camic_id which is obtained from the ‘TCIA Histopathology Custom Dataset Builder.json’ file. The json file can be obtained by navigating to the TCIA website under ‘Access The Data’, ‘Search Histopathology Portal’ and clicking on the ‘TCIA Histopathology Custom Dataset Builder’ link.

Code
svsinfo <- tcia_svs_info("311781") 
svsinfo |> head(3L)

The tcia_svs_info function returns a list containing the metadata of the SVS including the download URL. The download URL can be used to download the SVS images.

Code
svsinfo[["field_wsiimage"]][[1L]][["url"]]

Note that currently the package does not provide a function to download the ~150 MB json file programmatically.

The tcia_svs_download function downloads SVS images from the TCIA API. Like tcia_svs_info, the function requires a camic_id, which can be obtained from the ‘TCIA Histopathology Custom Dataset Builder.json’ file.

Code
tcia_svs_download("311781")

The function downloads the SVS images to the temporary directory by default. The destdir argument can be used to specify a different directory.

30.5 Appendix

References

Bergstrom, Erik N., Ammal Abbasi, Marcos Díaz-Gay, Loïck Galland, Sylvain Ladoire, Scott M. Lippman, and Ludmil B. Alexandrov. 2024. “Deep Learning Artificial Intelligence Predicts Homologous Recombination Deficiency and Platinum Response from Histologic Slides.” Journal of Clinical Oncology 42. https://doi.org/10.1200/JCO.23.02641.
Cooper, Lee AD, Elizabeth G Demicco, Joel H Saltz, Reid T Powell, Arvind Rao, and Alexander J Lazar. 2018. “PanCancer Insights from the Cancer Genome Atlas: The Pathologist’s Perspective.” Journal of Pathology 244: 512–24. https://doi.org/10.1002/path.5028.
Madabhushi, Anant, and George Lee. 2016. “Image Analysis and Machine Learning in Digital Pathology: Challenges and Opportunities.” Medical Image Analysis 33: 170–75. https://doi.org/10.1016/j.media.2016.06.037.
Pizurica, Marija, Yuanning Zheng, Francisco Carrillo-Perez, Humaira Noor, Wei Yao, Christian Wohlfart, Antoaneta Vladimirova, Kathleen Marchal, and Olivier Gevaert. 2024. “Digital Profiling of Gene Expression from Histology Images with Linearized Attention.” Nature Communications 15 (9886). https://doi.org/10.1038/s41467-024-54182-5.
Schmauch, Benoît, Alberto Romagnoni, Elodie Pronier, Charlie Saillard, Pascale Maillé, Julien Calderaro, Aurélie Kamoun, et al. 2020. “A Deep Learning Model to Predict RNA-Seq Expression of Tumours from Whole Slide Images.” Nature Communications 11 (3877). https://doi.org/10.1038/s41467-020-17678-4.
Tomczak, Katarzyna, Patrycja Czerwińska, and Maciej Wiznerowicz. 2015. “The Cancer Genome Atlas (TCGA): An Immeasurable Source of Knowledge.” Contemporary Oncology/Współczesna Onkologia 19: A68–77. https://doi.org/10.5114/wo.2014.47136.
Xu, Hanwen, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, et al. n.d. “A Whole-Slide Foundation Model for Digital Pathology from Real-World Data.” Nature 630: 181–88. https://doi.org/10.1038/s41586-024-07441-w.
Back to top