4 Data infrastructure

4.1 Introduction

Bioconductor provides several data classes for storing and manipulating spatial (transcript)omics datasets. By relying on these consistent data structures, we can easily connect methods and packages developed by different research groups to build analysis workflows that include the latest state-of-the-art methods.

Below, we describe the Bioconductor data classes used in this book.

4.2 File formats

Spatial (transcript)omics assays and the data acquired through them are diverse. In addition, different vendors provide different file formats upon data distributions. Here, we give an overview of frequently encountered file formats and their handling in R.

4.2.1 Count data

Data from sequencing-based assays typically include (cell) barcodes and metadata, and a matrix where rows/columns correspond to features/observations. These are typically provided as a set of .csv and .mtx files, or compressed versions thereof (e.g. .gz). For data from 10x Genomics, count data can be read into R using DropletUtils’s read10xCounts() function; data from other providers can be imported using standard R readers. For large-scale datasets (say, 100,000s of cells), .h5 files allow for out-of-memory representation of count matrices, represented as DelayedArrays in R (see HDF5Array).

4.2.2 .parquet

Tabular data (e.g. segmentation boundaries, molecule locations) may arrive in the form of .parquet files. These may be interfaced with using arrow. Notably, arrow’s read_parquet() functions allows for delayed dplyr-style operations, such as filter() and select(), allowing to query the data in a delayed fashion in order to, e.g., import only relevant parts into memory.

4.2.3 .zarr

.zarr stores can be used to store N-dimensional arrays as a grid of “chunks”, enabling parallelizable accession. For (bio)imaging data, different image scales (or resolutions) can be stored as different layers of a “pyramid”, where the base/tip represents the full/lowest resolution. R-interfaces to .zarr are provided through Rarr (Bioconductor) and pizzarr (CRAN).

4.3 Data classes

In sequencing-based ST data, measurements come in the form of a transcripts-by-spots count matrix, where each spot is additionally associated with spatial coordinates.

By contrast, imaging-based ST technologies yield molecule-level data that are typically provided as long-format tables where each row corresponds to an observation, and columns contain information about transcript identity, spatial location, and experimental metadata (e.g. sample of origin). Upon segmentation of cell boundaries and subsequent transcript-to-cell mapping, these data can be reshaped into a transcripts-by-cells count matrix that is analogous to data from single-cell omics technologies.

For both types of data, observations are associated with additional metadata such as area size of spots or of segmented cells and, for the latter, centroid locations and polygonal boundaries from segmentation.

4.3.1 Bioconductor-based

4.3.1.1 SingleCellExperiment

Single-cell RNA-seq and analogous technologies quantify transcripts at single-cell resolution, yielding a transcripts-by-cells count matrix. In Bioconductor, the primary class for data from single-cell experiments is SingleCellExperiment (Amezquita et al. 2020) (SCE).

SCE extends the SummarizedExperiment (SE) class by a series of features specific to single-cell data. For instance, reducedDims for low-dimensional embeddings of observations such as PCA, t-SNE, and UMAP; row- and colPairs for relationships between genes (e.g. gene-to-gene correlations) and cells (e.g. cell-to-cell distances), respectively; and, data on alternative features from the same cells, such as those obtained via multi-modal assays, are stored as altExps (for ‘alternative experiments’).

4.3.1.2 SpatialExperiment

SpatialExperiment (SPE) (Righelli et al. 2022) is the core data class used in this book. This class allows us to store datasets at the spot or cell level, i.e. data from sequencing-based platforms at the spot level, or data from imaging-based platforms aggregated to the cell level.

SPE extends SCE with additional customizations to store spatial information, such as spatial coordinates and image files. A schematic of the SpatialExperiment object structure is shown in Figure 4.1. Briefly, a SPE object consists of (i) assays containing expression counts, (ii) rowData containing information on features, i.e. genes, (iii) colData containing information on spots or cells, including non-spatial and spatial metadata, (iv) spatialCoords containing spatial coordinates, and (v) imgData containing image data. For spot-based data, a single assay named counts is used.

Figure 4.1: Overview of the `SpatialExperiment` data class for storing and manipulating spatial transcriptomics datasets within the Bioconductor framework.

4.3.1.3 SpatialFeatureExperiment

SPE has been extended through SpatialFeatureExperiment (SFE) (Moses et al. 2023), which can additionally accommodate observation- and feature-level graphs (e.g. of cell/spot neighborhoods) and geometries (e.g. segmentation and tissue boundaries, or histological regions annotated by a pathologist). Because these are represented as sf (geometries) and spdep (graphs) objects, SFE directly gives access to a range of geometry operations (e.g. intersecting and buffering) and spatial dependency calculations (e.g. Moran’s I and Geary’s C).

4.3.1.4 MoleculeExperiment

MoleculeExperiment (ME) (Peters Couto et al. 2023) is designed for imaging-based spatial transcriptomics data. For each sample, ME stores a list of molecules (e.g. transcript identities and coordinates), and boundaries (e.g. cell identities and polygon coordinates). The latter can, in principle, contain alternative segmentations that may stem from, e.g., cell membrane, body, or nucleus stainings. In this way, different count matrices may be obtained by allocating molecules to a given set of boundaries. Analyses at the aggregated cell level may, in turn, be carried out using an ME-derived SPE (the ME package provides a wrapper for this).

4.3.2 Non-Bioconductor-based

There are several other frameworks outside Bioconductor that support spatially-aware analysis for both sequencing- and imaging-based platforms. seurat and scanpy (Wolf, Angerer, and Theis 2018) provide comprehensive single-cell analysis pipelines in R and Python, respectively, and incorporate features to visualize and analyze spatial omics datasets. Packages including Giotto (Chen et al. 2023) (R), VoltRon (R) (Manukyan et al. 2023), and squidpy (Python) (Palla et al. 2022) support all-in-one frameworks for analyzing spatial omics data and contain extensive sets of spatially-aware algorithms.

4.3.2.1 Giotto

Giotto (Chen et al. 2023) (or Giotto Suite) provides tools to process, analyze and visualize spatial multi-omics data at multiple scales and resolutions. The package supports the analysis of an extensive set of sequencing- and imaging-based platforms with either transcriptomics and proteomics modailities such as Xenium, Visium HD and CODEX (Akoya). Giotto provides utilities to manipulate spatial objects and images, detect spatial patterns and spatially-aware clusters, and support database-based backed data using dbverse for scalability.

4.3.2.2 VoltRon

The VoltRon package incorporates a framework that allows spatially-aware alignment between datasets with observations that span across regions of interest (ROIs) and image tiles in addition to cells, spots, and molecules. Shiny applications are provided to conduct both automated and manual spatial alignment across adjacent tissue sections where users can interactively manage images and choose landmark points for co-registration. VoltRon is also an end-to-end spatial omics analysis package, and support on-disk representations of spatially-resolved datasets using .zarr and .h5 backed DelayedArray objects.

4.3.2.3 SpatialData

The SpatialData (Marconato et al. 2025) framework (Figure 4.2) offers a suite of modules for loading, handling, as well as (static and interactive) visualization of diverse spatial omics data (spanning imaging- and sequencing-based data and different modalities). In essence, the container relies on 5 types of elements (images, labels, shapes, points, and tables), which are represented on-disk using standardized, OME-NGFF compliant file formats (.zarr for images, labels, and tables; .parquet for shapes and polygons), enabling unified and memory-efficient data representation and handling.

Figure 4.2: Overview of the scverse/Python `SpatialData` framework.

4.4 Commercial solutions

Bruker’s AtoMx is a cloud-based platform for both interactive visualization and GUI-based analysis (“No coding experience is required to use AtoMx SIP.”). AtoMx can execute pipelines built from both custom and pre-configured modules, and incorporates Bruker’s multi-modal segmentation algorithm. The platform is integrated with CosMx SMI instruments, i.e. data is available through AtoMx upon acquisition.
10x Genomics’s Loupe Browser is a visualization software application for Visium, Visium HD, and other single-cell and multiome data by 10x Genomics. The complementary loupeR package allows for coercion of seurat objects (and, in turn, SummarizedExperiment-derived classes) into Loupe Browser-compliant files. Histopathological annotation of CytAssist images is also supported.
Vizgen’s Visualizer is included with the MERSCOPE instrument purchase. It allows for interactive visualization of segmentation boundaries, transcripts, and proteins; can incorporate external analysis results (e.g. clustering, dimensionality reduction); and, supports annotating and exporting regions of interest for downstream analysis.
10x Genomics’s Xenium Explorer is a desktop application (for Mac and Windows) that enables interactive visualization; can incorporate third-party results (e.g. cell metadata, other segmentations); annotating and exporting regions of interest; and, registering post-Xenium images. The application is limited to data generated by the Xenium Analyzer instrument.

4.5 Appendix

References

Amezquita, Robert A., Aaron T. L. Lun, Etienne Becht, Vince J. Carey, Lindsay N. Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods 17: 137–45. https://doi.org/10.1038/s41592-019-0654-x.

Chen, Jiaji George, Joselyn Cristina Chávez-Fuentes, Matthew O’Brien, Junxiang Xu, Edward Ruiz, Wen Wang, Iqra Amin, et al. 2023. “Giotto Suite: A Multi-Scale and Technology-Agnostic Spatial Multi-Omics Analysis Ecosystem.” bioRxiv. https://doi.org/10.1101/2023.11.26.568752.

Manukyan, Artür, Ella Bahry, Emanuel Wyler, Erik Becher, Anna Pascual-Reguant, Izabela Plumbom, Hasan Onur Dikmen, et al. 2023. “VoltRon: A Spatial Omics Analysis Platform for Multi-Resolution and Multi-Omics Integration Using Image Registration.” bioRxiv. https://doi.org/10.1101/2023.12.15.571667.

Marconato, Luca, Giovanni Palla, Kevin A. Yamauchi, Isaac Virshup, Elyas Heidari, Tim Treis, Wouter-Michiel Vierdag, et al. 2025. “SpatialData: An Open and Universal Data Framework for Spatial Omics.” Nature Methods 22 (1): 58–62. https://doi.org/10.1038/s41592-024-02212-x.

Moses, Lambda, Pétur Helgi Einarsson, Kayla Jackson, Laura Luebbert, A. Sina Booeshaghi, Sindri Antonsson, Nicolas Bray, Páll Melsted, and Lior Pachter. 2023. “Voyager: Exploratory Single-Cell Genomics Data Analysis with Geospatial Statistics.” bioRxiv. https://doi.org/10.1101/2023.07.20.549945.

Palla, Giovanni, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, et al. 2022. “Squidpy: A Scalable Framework for Spatial Omics Analysis.” Nature Methods 19 (2): 171–78.

Peters Couto, Bárbara Zita, Nicholas Robertson, Ellis Patrick, and Shila Ghazanfar. 2023. “MoleculeExperiment Enables Consistent Infrastructure for Molecule-Resolved Spatial Omics Data in Bioconductor.” Bioinformatics 39: btad550. https://doi.org/10.1093/bioinformatics/btad550.

Righelli, Dario, Lukas M. Weber, Helena L. Crowell, Brenda Pardo, Leonardo Collado-Torres, Shila Ghazanfar, Aaron T. L. Lun, Stephanie C. Hicks, and Davide Risso. 2022. “SpatialExperiment: Infrastructure for Spatially-Resolved Transcriptomics Data in R Using Bioconductor.” Bioinformatics 38 (11): 3128–31. https://doi.org/10.1093/bioinformatics/btac299.

Wolf, F Alexander, Philipp Angerer, and Fabian J Theis. 2018. “SCANPY: Large-Scale Single-Cell Gene Expression Data Analysis.” Genome Biology 19 (1): 15.