12 Deconvolution
12.1 Introduction
Sequencing-based ST data can contain zero to multiple cells per spot, which might be fully or only partially covered by cells, depending on the spatial resolution of the platform and the tissue cell density (see also Chapter 8 and the schematic Figure below). This aspect of the data implies that there may be a mixture of cell types in a spot and thus a mixture of transcriptional programs.
To help understand these mixtures, at least 20 deconvolution techniques have been proposed for spot-level ST data. Some methods require borrowing insights from a scRNA-seq reference dataset, while others can be reference-free. Based on their underlying algorithms, Li et al. (2023) have grouped methods into five categories:
-
probabilistic-based: use Bayesian inference, likelihood estimation, or probabilistic modeling to estimate cell type compositions while incorporating uncertainty. Available tools include spacexr (i.e.
RCTD
), CARD, SpatialDecon, and STdeconvolve in R, and std-poisson, stereoscope, scvi-tools (i.e.DestVI
), STRIDE, and cell2location in Python. -
non-negative matrix factorization (NMF)-based: decompose gene expression data into latent components representing different cell types. Available tools include Giotto (i.e.
SpatialDWLS
) and SPOTlight in R, and NMFReg in Python. - graph-based: use graph neural networks or graph-based optimization to model spatial relationships. Available tools include SD2 in R, and DSTG and SpiceMIx in Python.
- optimal transport (OT)-based: infer spatial gene expression distributions by mapping scRNA-seq and ST data. Available tools include SpaOTsc and novosparc in Python.
- deep learning-based: align and integrate single-cell and spatial transcriptomics data with neural networks. For example, Tangram in Python.
Among these, std-poisson
, STdeconvolve
, and SpiceMix
are reference-free methods. Methods that incorporate spatial location information are CARD
, DSTG
, SD2
, Tangram
, cell2location
, DestVI
, std-poisson
, and SpiceMix
.
In this section, we will demonstrate deconvolution of cell types per spot, using RCTD
on Visium and VisiumHD datasets.
12.2 Dependencies
In this example of Visium breast cancer data (Janesick et al. 2023), we perform cell type deconvolution without a single-cell (Chromium) reference and compare the concordance with the provided Visium annotation provided by 10x Genomics.
Code
# retrieve dataset from OSF repository
id <- "Visium_HumanBreast_Janesick"
pa <- OSTA.data_load(id)
dir.create(td <- tempfile())
unzip(pa, exdir=td)
# read into 'SpatialExperiment'
vis <- TENxVisium(
spacerangerOut=file.path(td, "outs"),
processing="filtered",
format="h5",
images="lowres") |>
import()
# retrieve spot annotations & add as metadata
df <- read.csv(file.path(td, "annotation.csv"))
cs <- match(colnames(vis), df$Barcode)
vis$anno <- factor(df$Annotation[cs])
# set gene symbols as feature names
rownames(vis) <- make.unique(rowData(vis)$Symbol)
vis
## class: SpatialExperiment
## dim: 18085 4992
## metadata(2): resources spatialList
## assays(1): counts
## rownames(18085): SAMD11 NOC2L ... MT-ND6 MT-CYB
## rowData names(3): ID Symbol Type
## colnames(4992): AACACCTACTATCGAA-1 AACACGTGCATCGCAC-1 ...
## TGTTGGCCAGACCTAC-1 TGTTGGCCTACACGTG-1
## colData names(5): in_tissue array_row array_col sample_id anno
## reducedDimNames(0):
## mainExpName: Gene Expression
## altExpNames(0):
## spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
## imgData names(4): sample_id image_id data scaleFactor
Code
xy <- spatialCoords(vis) * scaleFactors(vis)
ys <- nrow(imgRaster(vis)) - range(xy[, 2])
xs <- range(xy[, 1])
box <- geom_rect(
xmin=xs[1], xmax=xs[2], ymin=ys[1], ymax=ys[2],
col="black", fill=NA, linetype=2, linewidth=2/3)
plotVisium(vis, spots=FALSE, point_size=1) + box +
plotVisium(vis, point_size=1, zoom=TRUE) +
plot_layout(nrow=1) & facet_null()
Deconvolution is performed after quality control, as detailed in Chapter 10, and is usually performed on unnormalized and untransformed (i.e. raw) counts. Here, we quickly check some typically spot-level metrics.
Code
sub <- list(mt=grep("^MT-", rownames(vis)))
vis <- addPerCellQCMetrics(vis, subsets=sub)
Code
vis$log_sum <- log1p(vis$sum)
plotCoords(vis,
annotate="log_sum") +
ggtitle("log library size") +
plotCoords(vis,
annotate="subsets_mt_percent") +
ggtitle("% mitochondrial") +
ggplot(
data.frame(colData(vis)),
aes(x=sum, y=subsets_mt_percent)) +
geom_point() + geom_density_2d() +
scale_x_log10() + scale_y_sqrt() +
theme(aspect.ratio=2/3) +
plot_layout(nrow=1) & theme(
legend.key.width=unit(0.5, "lines"),
legend.key.height=unit(1, "lines")) &
scale_color_gradientn(colors=pals::jet())
A few spots have low library sizes, and can be removed.
Code
vis <- vis[, vis$sum > 1000]
We first visualize the spot-level cell type annotation provided by 10x Genomics.
Code
plotCoords(vis,
annotate="anno", point_size=1,
pal=unname(pals::trubetskoy())) +
theme(legend.key.size=unit(0, "lines"))
Now, we load the single-cell (Chromium) reference data for the Visium dataset. To streamline the demonstration, we consolidate some of the cell type annotations provided by 10x Genomics (i.e. Annotation
) into more generalized categories (i.e. Annogrp
).
Code
# retrieve dataset from OSF repo
id <- "Chromium_HumanBreast_Janesick"
pa <- OSTA.data_load(id)
dir.create(td <- tempfile())
unzip(pa, exdir=td)
# read into 'SingleCellExperiment'
fs <- list.files(td, full.names=TRUE)
h5 <- grep("h5$", fs, value=TRUE)
sce <- read10xCounts(h5, col.names=TRUE)
# use gene symbols as feature names
rownames(sce) <- make.unique(rowData(sce)$Symbol)
# retrieve cell type labels
csv <- grep("csv$", fs, value=TRUE)
cd <- read.csv(csv, row.names=1)
# ignore mixtures
lab <- cd$Annotation
lab[grepl("Hyb", lab)] <- NA
# simplify annotations
pat <- c(
"B Cell"="B", "T Cell"="T", "Mac"="macro", "Mast"="mast",
"DCs"="dendritic", "Peri"="perivas", "End"="endo",
"Str"="stromal", "Inv"="tumor", "Myo"="myoepi")
for (. in names(pat))
lab[grep(., lab)] <- pat[.]
lab <- gsub("\\s", "", lab)
# add as cell metadata
table(cd$Annogrp <- lab)
##
## B DCIS1 DCIS2 T dendritic endo macro
## 1463 1863 2159 4742 313 1055 3724
## mast myoepi perivas stromal tumor
## 92 1839 285 2611 5897
We only keep the Chromium data with an annotation and are not labeled as “Hybrid”, as these correspond to mixed subpopulations.
12.3 RCTD
Next, we perform deconvolution with spacexr (also known as RCTD)(Cable et al. 2022). By default, runRctd()
’s rctd_mode = "doublet"
specifies at most two subpopulations coexist in a data unit (i.e. within a spot); here, we set rctd_mode = "full"
in order to allow for an arbitrary number of subpopulations to be fit instead.
rctd_mode = "doublet"
, as demonstrated by (de Oliveira et al. 2025) and Chapter 15.Code
rctd_data <- createRctd(vis, sce, cell_type_col="Annogrp")
(res <- runRctd(rctd_data, max_cores=4, rctd_mode="full"))
## class: SpatialExperiment
## dim: 12 4962
## metadata(4): spatial_rna config cell_type_info internal_vars
## assays(1): weights
## rownames(12): B DCIS1 ... stromal tumor
## rowData names(0):
## colnames(4962): AACACCTACTATCGAA-1 AACACGTGCATCGCAC-1 ...
## TGTTGGCCAGACCTAC-1 TGTTGGCCTACACGTG-1
## colData names(1): sample_id
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : x y
## imgData names(0):
Weights inferred by RCTD
should be normalized such that proportions of cell types sum to 1 for each spot:
Code
## B DCIS1 DCIS2 T dendritic
## AACACCTACTATCGAA-1 0.00 0.00 0.00 0.00 0.01
## AACACGTGCATCGCAC-1 0.04 0.01 0.00 0.04 0.00
## AACACTTGGCAAGGAA-1 0.00 0.00 0.03 0.01 0.01
## AACAGGAAGAGCATAG-1 0.03 0.00 0.00 0.08 0.03
## AACAGGATTCATAGTT-1 0.00 0.00 0.01 0.00 0.00
12.4 CARD
Another method that can be used is CARD
. First, we rename the columns of spatial coordinates for CARD
.
Next, we perform the CARD
deconvolution. Here, we demonstrate CARD
’s interoperability with SingleCellExperiment
and SpatialExperiment
. The deconvolution result matrix is already normalized such that the sum of cell type proportions for each spot is equal to 1.
CARD
can also take a reference matrix, a reference cell type annotation column, a spatial count matrix, and a spatial coordinates data frame as separate items in sc_count
, sc_meta
, spatial_count
, and spatial_location
, respectively. However, we encourage simplifying the process by using existing Bioconductor classes.Code
set.seed(2025)
CARD_obj <- CARD_deconvolution(
spe=vis,
sce=sce,
sc_count=NULL,
sc_meta=NULL,
spatial_count=NULL,
spatial_location=NULL,
ct_varname="Annogrp",
ct_select=NULL, # use all 'sce$Annogrp' cell types
sample_varname=NULL, # use all 'sce' as one 'ref' sample
mincountgene=100,
mincountspot=5)
ws_card <- CARD_obj$Proportion_CARD
# order cell type names alphabetically, as for RCTD
ws_card <- data.frame(ws_card[, colnames(ws_rctd)])
round(ws_card[1:5, 1:5], 2)
## B DCIS1 DCIS2 T dendritic
## AACACCTACTATCGAA-1 0 0 0 0.00 0
## AACACGTGCATCGCAC-1 0 0 0 0.00 0
## AACACTTGGCAAGGAA-1 0 0 0 0.01 0
## AACAGGAAGAGCATAG-1 0 0 0 0.02 0
## AACAGGATTCATAGTT-1 0 0 0 0.00 0
12.5 Visualization
First, we define a couple accessory functions.
Code
.plt_xy <- \(ws, vis, col, point_size) {
xy <- spatialCoords(vis)[rownames(ws), ]
colnames(xy) <- c("x", "y")
df <- cbind(ws, xy)
ggplot(df, aes(x, y, col=.data[[col]])) +
coord_equal() + theme_void() +
geom_point(size=point_size)
}
.plt_decon <- \(ws, vis) {
ps <- lapply(names(ws), \(.) .plt_xy(ws, vis, col=., point_size=0.3))
ps |> wrap_plots(nrow=3) & theme(
legend.key.width=unit(0.5, "lines"),
legend.key.height=unit(1, "lines")) &
scale_color_gradientn(colors=pals::jet())
}
We can visualize deconvolution weights in x-y space, i.e., coloring by the proportion of a given cell type estimated to fall within a given spot:
Code
.plt_decon(ws=ws_rctd, vis)
Code
.plt_decon(ws=ws_card, vis)
The deconvolution results can also be viewed as a heatmap, where rows = cells and columns = clusters:
Code
In both methods, we see that more than half of the spots are estimated to have a stromal proportion of more than 50%. Few spots have an intense and distinct signal for cancerous subpopulations, DCIS1 and DCIS2. For the following analysis, we focus on RCTD
as an example.
For comparison with spot annotations provided by 10x Genomics, we include majority voted cell type from deconvolution by RCTD
. Note that, because stromal cells show broad signals across the entire tissue, to better investigate immune cell signals, we remove stromal from the majority vote calculation for an alternative label: RCTD_no_stroma
.
Code
ws <- ws_rctd
# derive majority vote label
ids <- names(ws)[apply(ws, 1, which.max)]
names(ids) <- rownames(ws)
vis$RCTD <- factor(ids[colnames(vis)])
# derive majority vote excluding stromal cells
ws_no_stroma <- ws[, colnames(ws) != "stromal"]
ids_no_stroma <- names(ws_no_stroma)[apply(ws_no_stroma, 1, which.max)]
names(ids_no_stroma) <- rownames(ws)
vis$RCTD_no_stroma <- factor(ids_no_stroma[colnames(vis)])
We can visualize these three annotations spatially:
Code
lapply(
c("anno", "RCTD", "RCTD_no_stroma"),
\(.) plotCoords(vis, annotate=.)) |>
wrap_plots(nrow=1) &
theme(legend.key.size=unit(0, "lines")) &
scale_color_manual(values=unname(pals::trubetskoy()))
Note the strong stromal signals and macrophages being the second most common cell type for stromal cells. To help characterize subpopulations from deconvolution, we can view their distribution against the provided annotation:
Code
cd <- data.frame(colData(vis))
df <- as.data.frame(with(cd, table(RCTD, anno)))
fd <- as.data.frame(with(cd, table(RCTD_no_stroma, anno)))
ggplot(df,
aes(Freq, RCTD, fill=anno)) +
ggtitle("RCTD") +
ggplot(fd,
aes(Freq, RCTD_no_stroma, fill=anno)) +
ggtitle("RCTD_no_stroma") +
plot_layout(nrow=1, guides="collect") &
labs(x="Proportion", y=NULL) &
coord_cartesian(expand=FALSE) &
geom_col(width=1, col="white", position="fill") &
scale_fill_manual(values=unname(pals::trubetskoy())) &
theme_minimal() & theme(
aspect.ratio=1,
legend.key.size=unit(2/3, "lines"),
plot.title=element_text(hjust=0.5))
Next, we can investigate the agreement between the provided annotation against the two deconvolution majority vote labels.
Code
hm <- \(mat, string) pheatmap(
mat, show_rownames=TRUE, show_colnames=TRUE, main=string,
cellwidth=10, cellheight=10, treeheight_row=5, treeheight_col=5)
hm(prop.table(table(vis$anno, vis$RCTD), 2), string="RCTD")
hm(prop.table(table(vis$anno, vis$RCTD_no_stroma), 2), string="RCTD_no_stroma")
Overall, we observe agreement between the provided spot labels and the RCTD
deconvolution derived annotations. Before cleaning up stromal, some immune cell types, such as dendritic and mast, never had a chance to have the highest cell type proportion. On the left panel, among among all the spots annotated by RCTD
as T cells, nearly all of them are from the “immune” type in the provided annotation. Strong agreements are also observed for spots with cell type of “DCIS1”, “DCIS2”, and “Invasive tumor”.
Next, we prepare the principal components (PCs) needed to perform PC regression:
Code
# log-library size normalization
vis <- logNormCounts(vis)
# feature selection
dec <- modelGeneVar(vis)
hvg <- getTopHVGs(dec, prop=0.1)
# dimension reduction
set.seed(1234)
vis <- runPCA(vis, ncomponents=20, subset_row=hvg)
We fit the deconvolution result of each cell type against the first 10 PCs to obtain 10 regressions.
Here we plot the coefficient of determination of the first 10 PCs for each cell type.
Code
pcr$id <- factor(pcr$id, ids)
pal <- pals::trubetskoy()
ggplot(pcr, aes(pc, r2, col=id)) +
geom_line(show.legend=FALSE) + geom_point() +
scale_color_manual("predictor", values=unname(pal)) +
scale_x_continuous(breaks=c(1, seq(5, 20, 5))) +
scale_y_continuous(limits=c(0, 1), breaks=seq(0, 1, 0.2)) +
labs(x="principal component", y="coeff. of determination") +
guides(col=guide_legend(override.aes=list(size=2))) +
coord_cartesian(xlim=c(1, 10)) +
theme_minimal() + theme(
panel.grid.minor=element_blank(),
legend.key.size=unit(0, "lines"))
Let’s inspect the key drivers of (expression) variability in terms of PCs. Considering deconvolution results from above, we can see that, e.g.:
- PC1 distinguishes stromal, tumor, macrophage from the rest of the tissue
- PC3, PC4 and PC5 separate DCIS1, T and endothelial cells, respectively
Code
# retrieve top-10 PCs
pcs <- reducedDim(vis, "PCA")
pcs <- pcs[rownames(ws), seq_len(10)]
# specify subpopulations & PCs to visualize
var <- c("DCIS1", "T", "endo")
var <- c(var, colnames(pcs)[3:5])
# visualize deconvolution weights alongside PCs
lapply(var, \(.) {
.plt_xy(
cbind(ws, pcs), vis, col=., point_size=0.3) +
scale_color_gradientn(., colors=pals::jet())
}) |>
wrap_plots(nrow=2) & theme(
plot.title=element_blank(),
legend.key.width=unit(0.5, "lines"),
legend.key.height=unit(1, "lines"))
Note that the direction of each PC is irrelevant from how much variation it explains.
In conclusion, deconvolution-based cell type proportion estimates are able to recapitulate PCs and, in turn, expression variability.
RCTD
can be used as a label transfer tool to annotate imaging-based ST data, such as for Xenium and MERSCOPE. For this, the default doublet_mode = "doublet"
should be used, and a certainty score would be returned to indicate doublets with two predicted cell types.12.6 Appendix
Benchmarks
Benchmarking studies of deconvolution methods often require generating synthetic spots to establish a ground truth for cell type proportions. This process involves either simulating artificial tissue patterns or aggregating counts from scRNA-seq or imaging-based spatial transcriptomics data into spots. However, it is equally important to evaluate how these methods perform in tissues with highly spatially-heterogeneous cell type compositions, such as real cancer samples. Below are three comprehensive benchmarking studies, two of which that incorporate both artificial and real datasets.
Sang-aram et al. (2023) developed a pipeline to benchmark 11 deconvolution methods, including
RCTD
, across 63 synthetic, 3 binned, and 2 real datasets.RCTD
andcell2location
were the most recommended methods. Figure 2 gives an overview of the benchmarking results.Li et al. (2023) benchmarked 18 deconvolution methods, including
RCTD
, across 50 simulated and real datasets. Among these methods,CARD
,cell2location
, andTangram
are highly recommended. Figure 1 summarizes the method performance, and Figure 4 gives a flowchart of how to decide on which method to use.Gaspard-Boulinc et al. (2025) review and compare available cell-type deconvolution methods, and provide a continuously updated web-based summary table.