9  Processing

9.1 Preamble

9.1.1 Introduction

This chapter demonstrates methods for several data processing steps – normalization, feature selection, and dimensionality reduction – that are required before downstream analysis methods can be applied.

We use methods from the scater (McCarthy et al. 2017) and scran (Lun, McCarthy, and Marioni 2016) packages, which were originally developed for scRNA-seq data, with the (simplified) assumption that these methods can be applied to spot-based ST data by treating spots as equivalent to cells. In addition, we demonstrate alternative spatially-aware methods.

9.1.2 Dependencies

Here, we import a 10x Genomics Visium dataset that will be used in several of the following chapters. This dataset has previously been preprocessed using data preprocessing procedures with tools outside R and saved in SpatialExperiment format, and is available for download from the STexampleData package.

This dataset consists of one sample (Visium capture area) from one donor, consisting of postmortem human brain tissue from the dorsolateral prefrontal cortex (DLPFC) brain region, measured with the 10x Genomics Visium platform. The dataset is described in Maynard et al. (2021).

Code
spe <- Visium_humanDLPFC()
imgData(spe) <- NULL # tmp fix
spe
##  class: SpatialExperiment 
##  dim: 33538 4992 
##  metadata(0):
##  assays(1): counts
##  rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##    ENSG00000268674
##  rowData names(3): gene_id gene_name feature_type
##  colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
##    TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
##  colData names(8): barcode_id sample_id ... reference cell_count
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
##  imgData names(0):

9.2 Normalization

9.2.1 Non-spatial

The simplest approach is to calculate log-transformed normalized counts (“logcounts”) using library size factors, treating each spot as equivalent to a single cell. This approach is simple and fast, and generally provides a good baseline. However, in some datasets, the (over)-simplified assumption of each spot being equivalent to a single cell will not be appropriate. This example uses the scater (McCarthy et al. 2017) and scran (Lun, McCarthy, and Marioni 2016) packages.

Note that some alternative non-spatial methods from single-cell workflows (e.g. normalization by deconvolution) are less appropriate for spot-based spatial transcriptomics data, since spots can contain multiple cells from multiple cell types, and datasets can contain multiple samples (e.g. multiple tissue sections, resulting in sample-specific clustering).

Library size-based normalization was originally developed for scRNA-seq data, and makes the assumption that spots are equivalent to single cells; this has been shown to be problematic because “library size confounds [spatial] biology” (Bhuva et al. 2024).
Code
# calculate library size factors
spe <- computeLibraryFactors(spe)
summary(sizeFactors(spe))
##     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.0194  0.4216  0.8880  1.0000  1.3995  4.7655
Code
hist(sizeFactors(spe), breaks=20)

Code
# calculate 'logcounts'; these will
# be added as an additional 'assay'
spe <- logNormCounts(spe)
assayNames(spe)
##  [1] "counts"    "logcounts"

9.2.2 Spatially-aware

SpaNorm (Salim et al. 2024) was developed for spatial transcriptomics data specifically, and uses a gene-wise model (e.g., negative binomial) in which variation is decomposed into a library size-dependent (technical) and -independent (biological) component.

9.3 Feature selection

9.3.1 Non-spatial: Highly variable genes (HVGs)

Here, we re-use feature selection methods from scRNA-seq workflows to identify a set of ‘highly variable’ genes (HVGs). These will be used as the input for subsequent steps, such as dimensionality reduction and clustering. In this example, we use the scran package (Lun, McCarthy, and Marioni 2016) to identify HVGs. As above, this makes the simplified assumption that spots can be treated as equivalent to single cells.

To identify HVGs, we first remove mitochondrial genes, since these are very highly expressed in this dataset and are not of main biological interest.

Note that HVGs are defined based only on molecular features (i.e., gene expression), and do not take any spatial information into account. If the biologically meaningful spatial information in this dataset mainly reflects spatial distributions of cell types (defined by gene expression), then relying on HVGs for downstream analyses may be sufficient. However, many datasets contain further spatial structure that is not captured in this way, which may be investigated using spatially-aware methods such as identifying spatially variable genes (SVGs).
Code
# identify mitochondrial genes
nm <- rowData(spe)$gene_name
mt <- grepl("^MT-", nm, ignore.case=TRUE)
table(mt)
##  mt
##  FALSE  TRUE 
##  33525    13
Code
# remove
spe <- spe[!mt, ]
nrow(spe)
##  [1] 33525

Next, apply methods from scran. This gives a list of HVGs, which can be used as the input for subsequent steps. The parameter prop defines the proportion of HVGs to select; for example, prop=0.1 returns the top 10%.

Alternatively, argument n can be used to selected a fixed number of genes, say 2,000.
Code
# fit mean-variance relationship, decomposing 
# variance into technical and biological components
dec <- modelGeneVar(spe)
# select top HVGs (i.e. variance above trend)
sel <- getTopHVGs(dec, prop=0.1)
length(sel)
##  [1] 1281
Code
# visualize mean-variance relationship
par(mar=c(4, 4, 0, 0))
fit <- metadata(dec)
plot(fit$mean, fit$var, cex=0.5, xlab="mean expression", ylab="variance")
points(dec[sel, "mean"], dec[sel, "total"], cex=0.5, col="dodgerblue")
curve(fit$trend(x), add=TRUE, lwd=2, col="tomato")

9.3.2 Spatially-aware: Spatially variable genes (SVGs)

Alternatively, spatially-aware methods can instead be used to identify ‘spatially variable genes’ (SVGs), either instead of or complementary to HVGs.

A number of methods have been developed to identify SVGs. These methods take the spatial information in spatial transcriptomics measurements into account, which can provide a more biologically informative ranking of genes.

For more details and examples, see Chapter 27.

9.4 Dimension reduction (DR)

A standard next step is to perform PCA to the set of top HVGs; see Chapter 25. In addition, we’ll perform non-linear DR technique, UMAP, on a subset of PCs (default 50).

Note that, the algorithm underlying UMAP is stochastic, so that UMAP ‘shapes’ will change from one run to the next. Thus, a random seed should be set in order to make the visualization reproducible.
Code
spe <- runPCA(spe, subset_row=sel)
spe <- runUMAP(spe, dimred="PCA")

9.5 Clustering

Next we’ll run a spatially aware clustering, BayesSpace (Zhao et al. 2021), developed specifically for seq-based ST data; see also Chapter 26.

Code
.spe <- spatialPreprocess(spe, skip.PCA=TRUE)
.spe <- spatialCluster(.spe, nrep=1e3, burn.in=100, q=10, d=20)
table(spe$BayesSpace <- factor(.spe$spatial.cluster))
##  
##     1    2    3    4    5    6    7    8    9   10 
##   727  271  155  703  359  232  316  799  287 1143

9.6 Visualizations

Let’s visualize cluster predictions in space, here alongside ground_truth spot labels from manual annotation:

Code
plotSpots(spe, annotate="ground_truth") +
plotSpots(spe, annotate="BayesSpace") +
    plot_layout() &
    theme(legend.key.size=unit(0, "lines")) &
    scale_color_manual(values=unname(pals::trubetskoy())) 

Inspecting the spatial distribution of the top PCs, we can observe that the main driver of expression variability is the distinction between white matter (WM) and non-WM brain layers:

Code
pcs <- reducedDim(spe, "PCA")
pcs <- pcs[, seq_len(4)]
lapply(colnames(pcs), \(.) {
    spe[[.]] <- pcs[, .]
    plotSpots(spe, annotate=.)
}) |>
    wrap_plots(nrow=1) &
    geom_point(shape=16, stroke=0, size=0.2) &
    scale_color_gradientn(colors=pals::jet()) &
    coord_equal() & theme_void() & theme(
        legend.position="bottom",
        plot.title=element_text(hjust=0.5),
        legend.key.width=unit(0.8, "lines"),
        legend.key.height=unit(0.4, "lines"))
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.
##  Coordinate system already present. Adding new coordinate system, which will
##  replace the existing one.
##  Coordinate system already present. Adding new coordinate system, which will
##  replace the existing one.
##  Coordinate system already present. Adding new coordinate system, which will
##  replace the existing one.
##  Coordinate system already present. Adding new coordinate system, which will
##  replace the existing one.

9.7 Marker genes

Having clustered our spots, we can test for genes that are differentially expressed (DE) between domains (here, using pairwise t-tests and specifically testing for upregulation, i.e., expression should be higher in the cluster for which a gene is reported to be a marker); see Chapter 26.

We could also test for spatially variable genes (SVGs) within domains to identify spatial patterns in gene expression that may be independent of spatial domains; see Chapter 27.
Code
mgs <- findMarkers(spe, groups=spe$BayesSpace, direction="up")
top <- lapply(mgs, \(df) rownames(df)[df$Top <= 3])
length(top <- unique(unlist(top)))
##  [1] 67

We can visualize selected markers as a heatmap where bins represent the average expression (here, log-transformed library size-normalize counts) of a given gene (= columns) in a given cluster (= rows).

To visually amplify differences, we use scale="column". This will, for every gene, subtract the mean and divide by the standard deviation across per-cluster means, bringing genes of potentially very different expression levels to a comparable scale.
Code
pbs <- aggregateAcrossCells(spe,  
    ids=spe$BayesSpace, subset.row=top, 
    use.assay.type="logcounts", statistics="mean")
# use symbols as feature names
mtx <- t(assay(pbs))
colnames(mtx) <- rowData(pbs)$gene_name
pheatmap(mat=mtx, scale="column")

As well as in space, i.e., coloring spots by their expression of a given marker:

Code
ps <- lapply(
    c("MBP", "PLP1", "NRGN", "SNAP25", "NEFL", "HPCAL1"), \(.) 
    plotSpots(spe, annotate=., feature_names="gene_name", assay_name="logcounts"))
wrap_plots(ps, nrow=2) & theme(
    legend.key.width=unit(0.4, "lines"),
    legend.key.height=unit(0.8, "lines")) &
    scale_color_gradientn(colors=rev(hcl.colors(9, "Rocket")))
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.
##  Scale for colour is already present.
##  Adding another scale for colour, which will replace the existing scale.

9.8 Appendix

Saving

Code
colLabels(spe) <- spe$BayesSpace
saveRDS(spe, "seq-spe_cl.rds")

References

Bhuva, Dharmesh D, Chin Wee Tan, Agus Salim, Claire Marceaux, Marie A Pickering, Jinjin Chen, Malvika Kharbanda, et al. 2024. “Library Size Confounds Biology in Spatial Transcriptomics Data.” Genome Biology 25 (1): 99.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-seq Data with Bioconductor.” F1000Research 5 (2122). https://doi.org/10.12688/f1000research.9501.2.
Maynard, Kristen R., Leonardo Collado-Torres, Lukas M. Weber, Cedric Uytingco, Brianna K. Barry, Stephen R. Williams, Joseph L. Catallini II, et al. 2021. “Transcriptome-Scale Spatial Gene Expression in the Human Dorsolateral Prefrontal Cortex.” Nature Neuroscience 24: 425–36. https://doi.org/10.1038/s41593-020-00787-0.
McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Wills. 2017. Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell RNA-seq Data in R.” Bioinformatics 33 (8): 1179–86. https://doi.org/10.1093/bioinformatics/btw777.
Salim, Agus, Dharmesh D Bhuva, Carissa Chen, Pengyi Yang, Melissa J Davis, and Jean Y H Yang. 2024. SpaNorm: Spatially-Aware Normalisation for Spatial Transcriptomics Data.” bioRxiv, June, 2024.05.31.596908.
Zhao, Edward, Matthew R. Stone, Xing Ren, Jamie Guenthoer, Kimberly S. Smythe, Thomas Pulliam, Stephen R. Williams, et al. 2021. “Spatial Transcriptomics at Subspot Resolution with BayesSpace.” Nature Biotechnology 39: 1375–84. https://doi.org/10.1038/s41587-021-00935-2.
Back to top