22 Sub-cellular analysis
22.1 Preamble
22.1.1 Introduction
Sub-cellular analysis aims to identify intra-cellular compartmentalization of transcripts, e.g., in the nucleus vs cytoplasm of the cell. Sub-cellular data can capture real biology, and analysis of sub-cellular could help identify biological mechanisms involved in transcript localisation (Cassella and Ephrussi 2022). Variation in intra-cellular localisation of transcripts generates gene expression gradients within the cell that can affect biological processes, such as cell communication and post-transcriptional regulation.
22.1.2 Data structure
In spatially-resolved transcriptomics (SRT), sub-cellular analysis can be performed with molecule-resolved data generated by imaging-based SRT technologies, such as MERFISH and Xenium, where targeted probes are used to capture specific transcripts and their locations. High-resolution sequencing-based SRT data, such as VisiumHD, have also been employed for sub-cellular analysis (Novoselsky et al. 2025).
Cell segmentation is required, prior to sub-cellular analysis, to generate cell boundaries within which the transcripts can be compartmentalized and assessed. MoleculeExperiment (Peters Couto et al. 2023) was built for storing transcript locations and cell boundaries, and is compatible with raw data generated by most existing molecule-resolved SRT technologies. It contains additional functions, such as countMolecules(), for conversion of molecule-resolved information to cell-level expression by counting transcripts within a cell boundary.
22.2 Quantifying cell compartments
22.2.1 Nucleus versus cell
Many molecule-resolved SRT technologies, such as MERFISH, SeqFISH, Stereo-seq, etc., provide nucleus and cell level data, generally using DAPI images to generate nuclear masks. These masks can be used to identify transcripts present in the nucleus vs cytoplasm of a cell, and the sub-cellular data can be used to ask a number of research questions. For example, the transport dynamics (from nucleus to cytoplasm) and localisation of transcripts within cells, the impact of sub-cellular localisation of transcripts on cell function, and comparison of sub-cellular location of transcripts between cells from different conditions.
22.2.2 Bento
Bento (Mah et al. 2024) is a Python-based toolkit for sub-cellular analysis of molecule-resolved SRT data that takes transcript locations and boundaries (cell, nucleus, region of interest) as input and stores them in AnnData format for downstream analyses. Bento includes 3 main approaches for processing molecule-resolved data:
It generates spatial summary statistics for each gene-cell pair and feeds these into RNAforest model to predict transcript localisation pattern (cytoplasmic, nuclear, nuclear edge, cell edge, or none) in each cell.
It identifies transcript compartment (nucleus or cytoplasm) from nucleus and cell boundaries, generating a cell x gene x compartment tensor. RNAcoloc approach is then applied to use the tensor for assessing co-localisation of gene pairs in each compartment.
It generates RNAflux embeddings of local neighborhoods in each cell, which are used for unsupervised segmentation of sub-cellular domains.
These approaches supplement other downstream analyses, such as differential expression analysis between compartments or sub-cellular domains or enrichment analysis of these compartments/domains.
22.2.3 SpatialFeatures
SpatialFeatures is a R package that uses MoleculeExperiment to store transcript location and boundaries (cells, nuclei, and/or regions of interest), and perform sub-cellular and extra-cellular analyses. SpatialFeatures run involves 3 main steps:
- Generate new sub-cellular (sub-concentric, sub-sector) and extra-cellular (super-concentric, super-sector) boundaries (
loadBoundaries()).
Calculate entropy-based metrics (
EntropyMatrix()) for each cell across the boundary feature type.Combine these information into a SingleCellExperiment object (
EntropySingleCellExperiment()) containing a cell x feature matrix assay for each boundary feature type (sub-concentric, sub-sector, super-concentric, super-sector).
These features can be used to cluster and identify cells with similar sub-cellular and/or extra-cellular expression patterns.
22.3 Factor modelling
22.3.1 FISHFactor
FISHFactor (Walter, Stegle, and Velten 2023) is another Python-based method for sub-cellular data analysis. It uses spatial Poisson point processes to model location of each transcript within each cell and spatially-aware Gaussian processes to identify sub-cellular localisation patterns.
This approach focuses on transcript sub-cellular patterns or domains, which can be investigated further to gain biological insights (Walter, Stegle, and Velten 2023).
22.4 Subcellular clustering
22.4.1 ClusterMap
ClusterMap (He et al. 2021) is a Python-based method capable of segmentation-free spatial clustering of transcripts to multiple scales, thereby identifying sub-cellular domains that might represent sub-cellular structures or cell bodies or cell-level clusters representing cell types and domains. It can also perform cell segmentation and sub-cellular compartmentalization using just transcript locations. Depending on the radius used to measure neighborhood size around a transcript, it can identify clusters at sub-cellular (sub-cellular domains/compartments), cell (cell types), or tissue (domain) level, and perform sub-cellular or cell segmentation.
22.5 Testing for subcellular localisation and co-localisation
22.5.1 CellSP
CellSP (Aggarwal and Sinha 2025) is a Python-based workflow for identifying sub-cellular patterns of transcripts that can be further used to identify and characterize gene-cell modules. It provides tools for visualizing the gene-cell modules and examining their functional significance.
CellSP uses AnnData format for single cell and spatial transcriptomics analysis. Internally, CellSP uses other tools, such as MAGIC (Dijk et al. 2018) for denoising, Tangram (Biancalani et al. 2021) for imputation, InSTAnT (Kumar et al. 2024) for identifying gene-pair sub-cellular co-localisation patterns, and SPRAWL (Bierman et al. 2024) for identifying gene sub-cellular localisation patterns.
CellSP outputs a set of gene-cell modules for each sub-cellular pattern type (peripheral, radial, punctate, central, co-localisation). A gene-cell module represents a set of genes or gene pairs that have the same sub-cellular pattern across the same set of cells. The statistical significance of grouping is estimated using a Bonferroni-based score. In each module, genes and cells are characterized using gene ontology (GO) enrichment tests and cell type composition (if available), respectively.
Some statistical tests for gene(s) localisation/co-localisation that can be performed using CellSP include:
Testing per cell - does a cell have significant sub-cellular localisation of a set of genes or gene-pairs?
Testing across multiple cells - do cells belonging to a cell type/cluster have significant sub-cellular localisation of a gene or gene pair?
Testing across multiple groups/samples - are specific genes differentially localized/co-localized in two cell types/clusters from one sample or a cell type/cluster from two different conditions?
22.5.2 SpaGNN
SpaGNN (Fang et al. 2023) is another Python-based pipeline for sub-cellular analyses, including:
Sub-cellular clustering of transcript locations into sub-cellular patches with high transcript density using Leiden graph clustering.
Sub-cellular patch analysis, where transcripts of each gene in each patch are summed to generate a patch x gene counts data. This is used to calculate Pearson’s correlation between genes across all patches, to identify gene pairs that often co-localize. The statistical significance of these correlation values can be assessed using a t-test.
Sub-cellular local neighborhoods are detected within a patch by identifying 9 nearest neighbors of each transcript in the patch. Further analysis involves summing transcript counts in the local neighborhoods and calculating Pearson’s correlation between genes. Through permutation analysis, a proximity score is calculated for each gene pair in the patch.
Depending on availability of cell type data, additional questions can be asked. For example, how similar are the patch correlations between same/different cell types, or how consistent is the sub-cellular co-localisation of a gene pair in same/different cell types?
22.6 Considerations
22.6.1 2D versus 3D
A majority of the molecule-resolved SRT data are 2D projections of 3D structures, such as cells, organelles, or even processes like RNA transport. A few imaging-based SRT technologies are capable of measuring tissue depth as z-axis, generating 3D coordinates with a sparse z-axis. However, working with 2D vs 3D data is not so different, since spatial relations are often captured as neighborhoods defined by Euclidean or other distance measures between locations, irrespective of coordinate dimension. Therefore, many tools are able to use n-dimension coordinates for measuring spatial relationships. For example, ClusterMap can use 3D coordinates by using z_radius for z-axis data, if its available, whereas for 2D coordinates it sets z_radius = 0 (He et al. 2021).