10  Deconvolution

10.1 Introduction

Seq-based ST data can contain zero to multiple cells per spot, which might be fully covered or fractured, depending on the spatial resolution of the platform and the tissue cell density (c.f. Chapter 7). This aspect of the data necessitates a computational estimation of cell type composition for each spot to analyze the spatial distribution of biological signals. In addition, the spatial coordinates of ST data sets it apart from single-cell RNA sequencing data, making it sub-optimal to directly apply existing single-cell label transfer methods.

Schematic illustrating multiple cells in a 10x Visium spot (diameter of 55um; purple line) overlaying the H&E image. The distance between spots is 100um (yellow line), and an average immune cells’ diameter is around 10um (cyan line).

There has been around twenty deconvolution techniques proposed for spot-level ST. Some methods require borrowing insights from a scRNAseq reference, while others can be reference-free. Based on their underlying algorithms, Li et al. (2023) have grouped methods into five categories:

  • probabilistic-based: use Bayesian inference, likelihood estimation, or probabilistic modeling to estimate cell-type compositions while incorporating uncertainty. Available tools are spacexr (i.e. RCTD), CARD, SpatialDecon, and STdeconvolve in R, and std-poisson (i.e. Berglund et al.), stereoscope, scvi-tools (i.e. DestVI), STRIDE, and cell2location in Python.
  • non-negative matrix factorization (NMF)-based: decompose gene expression data into latent components representing different cell types. Available tools are Giotto (i.e. SpatialDWLS) and SPOTlight in R, and NMFReg in Python.
  • graph-based: use graph neural networks or graph-based optimization to model spatial relationships. Available tools are SD2 in R, and DSTG and SpiceMIx in Python.
  • optimal transport (OT)-based: infer spatial gene expression distributions by mapping scRNAseq and ST data. Available tools are SpaOTsc and novosparc in Python.
  • deep learning-based: align and integrate single-cell and spatial transcriptomic data with neural networks. For example, Tangram in Python.

Among these, Berglund et al., STdeconvolve, and SpiceMix are reference-free methods. Methods that incorporate spatial location information are CARD, DSTG, SD2, Tangram, cell2location, DestVI, Berglund et al., and SpiceMix.

In this section, we will demonstrate deconvolve cell types per spot, using RCTD on Visium and VisiumHD datasets.

10.2 Dependencies

In this example of Visium breast cancer data (Janesick et al. 2023), we perform Visium cell type deconvolution without Chromium reference and compare the concordance with the provided Visium annotation by 10x Genomics.

Code
# retrieve dataset from OSF repository
id <- "Visium_HumanBreast_Janesick"
pa <- OSTA.data_load(id)
dir.create(td <- tempfile())
unzip(pa, exdir=td)

# read into 'SpatialExperiment'
vis <- TENxVisium(
  spacerangerOut=td, 
  processing = "filtered", 
  format="h5", 
  images="lowres") |> 
  import()

# retrieve spot annotations & add as metadata
df <- read.csv(file.path(td, "annotation.csv"))
cs <- match(colnames(vis), df$Barcode)
vis$anno <- factor(df$Annotation[cs])

# set gene symbols as feature names
rownames(vis) <- make.unique(rowData(vis)$Symbol)
vis
##  class: SpatialExperiment 
##  dim: 18085 4992 
##  metadata(2): resources spatialList
##  assays(1): counts
##  rownames(18085): SAMD11 NOC2L ... MT-ND6 MT-CYB
##  rowData names(3): ID Symbol Type
##  colnames(4992): AACACCTACTATCGAA-1 AACACGTGCATCGCAC-1 ...
##    TGTTGGCCAGACCTAC-1 TGTTGGCCTACACGTG-1
##  colData names(5): in_tissue array_row array_col sample_id anno
##  reducedDimNames(0):
##  mainExpName: Gene Expression
##  altExpNames(0):
##  spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
##  imgData names(4): sample_id image_id data scaleFactor

Deconvolution comes after proper quality control, as detailed in Chapter 8, and is usually performed on unnormalized and untransformed (i.e., raw) counts. Here, we quickly check if any spot has a count of 0 and remove it.

Code
sub <- list(mt=grep("^MT-", rownames(vis)))
vis <- addPerCellQCMetrics(vis, subsets=sub)

All spots have non-zero library size, so we can proceed without subsetting. We first visualize the spot-level cell type annotation provided by 10x Genomics.

Code
plotSpots(vis, 
  annotate = "anno", point_size = 1, 
  pal = unname(pals::trubetskoy())) +
  theme(legend.key.size = unit(0, "lines"))

Now, we load the Chromium reference data for the Visium dataset. To streamline the demonstration, we consolidate some of the cell type annotations provided by 10x Genomics (i.e. Annotation) into more generalized categories (i.e. Annogrp).

Code
# retrieve dataset from OSF repo
id <- "Chromium_HumanBreast_Janesick"
pa <- OSTA.data_load(id)
dir.create(td <- tempfile())
unzip(pa, exdir=td)

# read into 'SingleCellExperiment'
fs <- list.files(td, full.names=TRUE)
h5 <- grep("h5$", fs, value=TRUE)
sce <- read10xCounts(h5, col.names=TRUE)

# use gene symbols as feature names
rownames(sce) <- make.unique(rowData(sce)$Symbol)

# retrieve cell type labels
csv <- grep("csv$", fs, value=TRUE)
cd <- read.csv(csv, row.names=1)

# simplify annotations
pat <- c(
    "B Cell"="B", "T Cell"="T", "Mac"="macro", "Mast"="mast", "DCs"="dendritic", 
    "Peri"="perivas", "End"="endo", "Str"="stromal", "Inv"="tumor", "Myo"="myoepi")
lab <- cd$Annotation
for (. in names(pat)) 
    lab[grep(., lab)] <- pat[.]
lab[grepl("Hyb", lab)] <- NA
lab <- gsub("\\s", "", lab)

# add as cell metadata
table(cd$Annogrp <- lab)
##  
##          B     DCIS1     DCIS2         T dendritic      endo     macro 
##       1463      1863      2159      6171       313      1055      3724 
##       mast    myoepi   perivas   stromal     tumor 
##         92      1839       285      2611      5897
Code
colData(sce)[names(cd)] <- cd[colnames(sce), ]

We only keep the Chromium data with an annotation and are not labeled as “Hybrid”, as these correspond to mixed subpopulations.

Code
sce <- sce[, !is.na(sce$Annogrp)]
dim(sce)
##  [1] 18082 27460

10.3 RCTD

Next, we perform deconvolution with spacexr (f.k.a. RCTD)(Cable et al. 2022). By default, runRctd()’s rctd_mode="doublet", specifies at most two subpopulations coexist in a data unit, such as a “spot”; here, we set rctd_mode="full" in order to allow for an arbitrary number of subpopulations to be fit instead.

Note that RCTD can also be adapted to VisiumHD data with rctd_mode="doublet", as demonstrated by (Oliveira et al. 2024) and Chapter 13.
Code
obj <- createRctd(vis, sce, cell_type_col="Annogrp", max_cores=10)
(res <- runRctd(obj, rctd_mode="full"))
##  class: SpatialExperiment 
##  dim: 12 4988 
##  metadata(6): spatial_rna original_spatial_rna ... cell_type_info
##    internal_vars
##  assays(1): weights_full
##  rownames(12): B DCIS1 ... stromal tumor
##  rowData names(0):
##  colnames(4988): AACACCTACTATCGAA-1 AACACGTGCATCGCAC-1 ...
##    TGTTGGCCAGACCTAC-1 TGTTGGCCTACACGTG-1
##  colData names(1): sample_id
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  spatialCoords names(2) : x y
##  imgData names(0):

Weights inferred by RCDT should be normalized such that proportions of cell types sum to 1 in each spot:

Code
# scale weights such that they sum to 1
ws <- assay(res)
ws <- sweep(ws, 2, colSums(ws), `/`)
# add proportion estimates as metadata
ws <- data.frame(t(as.matrix(ws)))
colData(vis)[names(ws)] <- ws[colnames(vis), ]

Let’s visualize deconvolution weights in space, i.e., coloring by the proportion of a given cell type estimated to fall within a given spot:

Code
lapply(names(ws), \(.) 
    plotSpots(vis, annotate=., point_size = 0.05)) |>
    wrap_plots(nrow=3) & theme(
    legend.key.width=unit(0.5, "lines"),
    legend.key.height=unit(1, "lines")) &
    scale_color_gradientn(colors=pals::jet())

The deconvolution result can also be viewed as a heatmap, where rows = cells and columns = clusters:

Code
pheatmap(ws, show_rownames=FALSE, show_colnames=TRUE,
    cellwidth=12, treeheight_row=5, treeheight_col=5)

We see that more than half of the spots are estimated to have a stromal proportion of more than 50%. Few spots have an intense and distinct signal for cancerous subpopulations, DCIS1 and DCIS2.

For comparison with spot annotations provided by 10x Genomics, we include majority voted cell type from deconvolution as RCTD. Note that, because stromal cells show broad signals across the entire tissue, to better investigate immune cell signals, we remove stromal from the majority vote calculation for an alternative label: RCTD_no_stroma.

Code
# derive majority vote label
ids <- names(ws)[apply(ws, 1, which.max)]
names(ids) <- rownames(ws)
vis$RCTD <- factor(ids[colnames(vis)])

# derive majority vote nouding stromal cells
ws_no_stroma <- ws[, colnames(ws) != "stromal"]
ids_no_stroma <- names(ws_no_stroma)[apply(ws_no_stroma, 1, which.max)]
names(ids_no_stroma) <- rownames(ws)
vis$RCTD_no_stroma <- factor(ids_no_stroma[colnames(vis)])

We can visualize these three annotations spatially:

Code
lapply(c("anno", "RCTD", "RCTD_no_stroma"), 
    \(.) plotSpots(vis, annotate=.)) |>
    wrap_plots(nrow=1) &
    theme(legend.key.size=unit(0, "lines")) &
    scale_color_manual(values=unname(pals::trubetskoy()))

Note the strong stromal signals and macrophage being the second common cell type for stromal cells. To help characterize subpopulations from deconvolution, we can view their distribution against the provided annotation:

Code
cd <- data.frame(colData(vis))
df <- as.data.frame(with(cd, table(RCTD, anno)))
fd <- as.data.frame(with(cd, table(RCTD_no_stroma, anno)))
ggplot(df, aes(Freq, RCTD, fill=anno)) + ggtitle("RCTD") +
ggplot(fd, aes(Freq, RCTD_no_stroma, fill=anno)) + ggtitle("RCTD_no_stroma") +
plot_layout(nrow=1, guides="collect") &
    labs(x="Proportion", y=NULL) &
    coord_cartesian(expand=FALSE) &
    geom_col(width=1, col="white", position="fill") &
    scale_fill_manual(values=unname(pals::trubetskoy())) &
    theme_minimal() & theme(aspect.ratio=1,
        legend.key.size=unit(2/3, "lines"),
        plot.title=element_text(hjust=0.5))

Next, we can investigate the agreement between provided annotation against the two deconvolution majority vote labels.

Code
hm <- \(.) pheatmap(., show_rownames=TRUE, show_colnames=TRUE,
    cellwidth=10, cellheight=10, treeheight_row=5, treeheight_col=5)
hm(prop.table(table(vis$anno, vis$RCTD), 2))
hm(prop.table(table(vis$anno, vis$RCTD_no_stroma), 2))

Overall, we observe agreement between the provided spot-labels and RCTD deconvolution derived annotations. Before cleaning up stromal, some immune cell types, such as dendritic and mast, never had a chance to have the highest cell type proportion. On the left panel, among among all the spots annotated by RCTD as T cells, nearly all of them are in “immune” type in the provided annotation. Strong agreements are also observed for spots with cell type of “DCIS1”, “DCIS2”, and “Invasive tumor”.

Next, we prepare the principal components (PCs) needed to perform PC regression:

Code
vis <- logNormCounts(vis)
# Feature selection 
dec <- modelGeneVar(vis)
hvg <- getTopHVGs(dec, prop = 0.1)

# Dimension reduction 
set.seed(1234)
vis <- runPCA(vis, ncomponents = 20, subset_row = hvg)

We fit the deconvolution result of each cell type against the first 10 PCs to obtain 10 regressions.

Code
idx <- rownames(ws)
ids <- colnames(ws)
pcs <- reducedDim(vis, "PCA")
pcs <- pcs[idx, seq_len(10)]
pcr <- lapply(ids, \(id) {
    fit <- summary(lm(pcs ~ vis[, idx][[id]]))
    r2 <- sapply(fit, \(.) .$adj.r.squared)
    data.frame(id, pc=seq_along(r2), r2)
}) |> do.call(what=rbind)

Here we plot the coefficient of determination of the first 10 PCs for each cell type.

Code
pcr$id <- factor(pcr$id, ids)
pal <- pals::trubetskoy()
ggplot(pcr, aes(pc, r2, col=id)) +
    geom_line(show.legend=FALSE) + geom_point() +
    scale_color_manual("predictor", values=unname(pal)) +
    scale_x_continuous(breaks=c(1, seq(5, 20, 5))) +
    scale_y_continuous(limits=c(0, 1), breaks=seq(0, 1, 0.2)) +
    labs(x="principal component", y="coeff. of determination") +
    guides(col=guide_legend(override.aes=list(size=2))) +
    coord_cartesian(xlim=c(1, 10)) +
    theme_minimal() + theme(
        panel.grid.minor=element_blank(),
        legend.key.size=unit(0, "lines"))

Let’s inspect the key drivers of (expression) variability in terms of PCs. Considering deconvolution results from above, we can see that

  • PC1 distinguishes stromal, tumor, macrophage from the rest of the tissue;
  • PC3 clearly separates DCIS1;
  • PC4 captures T cells region;
  • PC8 separates DCIS2.
Code
pcs <- reducedDim(vis, "PCA")
colData(vis)[colnames(pcs)] <- pcs
lapply(c("DCIS1", "T", "DCIS2", colnames(pcs)[c(3,4,8)]), 
    \(.) plotSpots(vis, annotate=.) +
    scale_color_gradientn(., colors=pals::jet())) |>
    wrap_plots(nrow=2) & theme(
        plot.title=element_blank(),
        legend.key.width=unit(0.5, "lines"),
        legend.key.height=unit(1, "lines"))

Note that the direction of each PC is irrelevant from how much variation it explains.

In conclusion, deconvolution-based cell type proportion estimates are able to recapitulate PCs and, in turn, expression variability.

Apart from being a tool for spot deconvolution, RCTD can be used as a label transferring tool to annotate imaging-based ST data, such as for Xenium and MERSCOPE. In such scenarios, the default doublet_mode="doublet" should be used and a certainty score would be returned to indicate doublets with two predicted cell types.

10.4 Appendix

10.4.1 Benchmarks

Benchmarking studies of deconvolution methods often requires generating synthetic spots to establish a ground truth for cell type proportions. This process involves either simulating artificial tissue patterns or aggregating counts from scRNAseq or imaging-based spatial transcriptomics data into spots. However, it is equally important to evaluate how these methods perform in tissues with spatially highly heterogeneous cell type compositions, such as real cancer samples. Below are two comprehensive benchmarking studies that incorporate both artificial and real datasets.

  • Chananchida et al. (2023) developed a pipeline to benchmark 11 deconvolution methods, including RCTD, across 63 synthetic, 3 binned, and 2 real datasets. RCTD and cell2location were the most recommended methods. Figure 2 gives an overview of the benchmarking results.

  • Li et al. (2023) benchmarked 18 deconvolution methods, including RCTD, across 50 simulated and real datasets. Among these methods, CARD, cell2location, and Tangram are highly recommended. Figure 1 summarizes the methods performance, and Figure 4 gives a flowchart of how to decide on which method to use.

References

Cable, Dylan M., Evan Murray, Luli S. Zou, Aleksandrina Goeva, Evan Z. Macosko, Fei Chen, and Rafael A. Irizarry. 2022. “Robust Decomposition of Cell Type Mixtures in Spatial Transcriptomics.” Nature Biotechnology 40 (4): 517–26. https://doi.org/10.1038/s41587-021-00830-w.
Chananchida, S, B Robin, S Ruth, and S Yvan. 2023. “Spotless: A Reproducible Pipeline for Benchmarking Cell Type Deconvolution in Spatial Transcriptomics. eLife 12.” RP88431 10.
Janesick, Amanda, Robert Shelansky, Andrew D. Gottscho, Florian Wagner, Stephen R. Williams, Morgane Rouault, Ghezal Beliakoff, et al. 2023. “High Resolution Mapping of the Tumor Microenvironment Using Integrated Single-Cell, Spatial and in Situ Analysis.” Nature Communications 14 (1): 8353.
Li, Haoyang, Juexiao Zhou, Zhongxiao Li, Siyuan Chen, Xingyu Liao, Bin Zhang, Ruochi Zhang, Yu Wang, Shiwei Sun, and Xin Gao. 2023. “A Comprehensive Benchmarking with Practical Guidelines for Cellular Deconvolution of Spatial Transcriptomics.” Nature Communications 14 (1): 1548.
Oliveira, Michelli F., Juan P. Romero, Meii Chung, Stephen Williams, Andrew D. Gottscho, Anushka Gupta, Susan E. Pilipauskas, et al. 2024. “Characterization of Immune Cell Populations in the Tumor Microenvironment of Colorectal Cancer Using High Definition Spatial Profiling.” bioRxiv. https://doi.org/10.1101/2024.06.04.597233.
Back to top