9 Processing
9.1 Preamble
9.1.1 Introduction
This chapter demonstrates methods for several data processing steps – normalization, feature selection, and dimensionality reduction – that are required before downstream analysis methods can be applied.
We use methods from the scater (McCarthy et al. 2017) and scran (Lun, McCarthy, and Marioni 2016) packages, which were originally developed for scRNA-seq data, with the (simplified) assumption that these methods can be applied to spot-based ST data by treating spots as equivalent to cells. In addition, we demonstrate alternative spatially-aware methods.
9.1.2 Dependencies
Here, we import a 10x Genomics Visium dataset that will be used in several of the following chapters. This dataset has previously been preprocessed using data preprocessing procedures with tools outside R and saved in SpatialExperiment
format, and is available for download from the STexampleData package.
This dataset consists of one sample (Visium capture area) from one donor, consisting of postmortem human brain tissue from the dorsolateral prefrontal cortex (DLPFC) brain region, measured with the 10x Genomics Visium platform. The dataset is described in Maynard et al. (2021).
Code
spe <- Visium_humanDLPFC()
imgData(spe) <- NULL # tmp fix
spe
## class: SpatialExperiment
## dim: 33538 4992
## metadata(0):
## assays(1): counts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(3): gene_id gene_name feature_type
## colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
## TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
## colData names(8): barcode_id sample_id ... reference cell_count
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
## imgData names(0):
9.2 Normalization
9.2.1 Non-spatial
The simplest approach is to calculate log-transformed normalized counts (“logcounts”) using library size factors, treating each spot as equivalent to a single cell. This approach is simple and fast, and generally provides a good baseline. However, in some datasets, the (over)-simplified assumption of each spot being equivalent to a single cell will not be appropriate. This example uses the scater (McCarthy et al. 2017) and scran (Lun, McCarthy, and Marioni 2016) packages.
Note that some alternative non-spatial methods from single-cell workflows (e.g. normalization by deconvolution) are less appropriate for spot-based spatial transcriptomics data, since spots can contain multiple cells from multiple cell types, and datasets can contain multiple samples (e.g. multiple tissue sections, resulting in sample-specific clustering).
Code
# calculate library size factors
spe <- computeLibraryFactors(spe)
summary(sizeFactors(spe))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0194 0.4216 0.8880 1.0000 1.3995 4.7655
Code
hist(sizeFactors(spe), breaks=20)
Code
# calculate 'logcounts'; these will
# be added as an additional 'assay'
spe <- logNormCounts(spe)
assayNames(spe)
## [1] "counts" "logcounts"
9.2.2 Spatially-aware
SpaNorm (Salim et al. 2024) was developed for spatial transcriptomics data specifically, and uses a gene-wise model (e.g., negative binomial) in which variation is decomposed into a library size-dependent (technical) and -independent (biological) component.
9.3 Feature selection
9.3.1 Non-spatial: Highly variable genes (HVGs)
Here, we re-use feature selection methods from scRNA-seq workflows to identify a set of ‘highly variable’ genes (HVGs). These will be used as the input for subsequent steps, such as dimensionality reduction and clustering. In this example, we use the scran package (Lun, McCarthy, and Marioni 2016) to identify HVGs. As above, this makes the simplified assumption that spots can be treated as equivalent to single cells.
To identify HVGs, we first remove mitochondrial genes, since these are very highly expressed in this dataset and are not of main biological interest.
Next, apply methods from scran. This gives a list of HVGs, which can be used as the input for subsequent steps. The parameter prop
defines the proportion of HVGs to select; for example, prop=0.1
returns the top 10%.
n
can be used to selected a fixed number of genes, say 2,000.Code
# fit mean-variance relationship, decomposing
# variance into technical and biological components
dec <- modelGeneVar(spe)
# select top HVGs (i.e. variance above trend)
sel <- getTopHVGs(dec, prop=0.1)
length(sel)
## [1] 1281
9.3.2 Spatially-aware: Spatially variable genes (SVGs)
Alternatively, spatially-aware methods can instead be used to identify ‘spatially variable genes’ (SVGs), either instead of or complementary to HVGs.
A number of methods have been developed to identify SVGs. These methods take the spatial information in spatial transcriptomics measurements into account, which can provide a more biologically informative ranking of genes.
For more details and examples, see Chapter 27.
9.4 Dimension reduction (DR)
A standard next step is to perform PCA to the set of top HVGs; see Chapter 25. In addition, we’ll perform non-linear DR technique, UMAP, on a subset of PCs (default 50).
9.5 Clustering
Next we’ll run a spatially aware clustering, BayesSpace (Zhao et al. 2021), developed specifically for seq-based ST data; see also Chapter 26.
Code
.spe <- spatialPreprocess(spe, skip.PCA=TRUE)
.spe <- spatialCluster(.spe, nrep=1e3, burn.in=100, q=10, d=20)
table(spe$BayesSpace <- factor(.spe$spatial.cluster))
##
## 1 2 3 4 5 6 7 8 9 10
## 727 271 155 703 359 232 316 799 287 1143
9.6 Visualizations
Let’s visualize cluster predictions in space, here alongside ground_truth
spot labels from manual annotation:
Code
plotSpots(spe, annotate="ground_truth") +
plotSpots(spe, annotate="BayesSpace") +
plot_layout() &
theme(legend.key.size=unit(0, "lines")) &
scale_color_manual(values=unname(pals::trubetskoy()))
Inspecting the spatial distribution of the top PCs, we can observe that the main driver of expression variability is the distinction between white matter (WM) and non-WM brain layers:
Code
pcs <- reducedDim(spe, "PCA")
pcs <- pcs[, seq_len(4)]
lapply(colnames(pcs), \(.) {
spe[[.]] <- pcs[, .]
plotSpots(spe, annotate=.)
}) |>
wrap_plots(nrow=1) &
geom_point(shape=16, stroke=0, size=0.2) &
scale_color_gradientn(colors=pals::jet()) &
coord_equal() & theme_void() & theme(
legend.position="bottom",
plot.title=element_text(hjust=0.5),
legend.key.width=unit(0.8, "lines"),
legend.key.height=unit(0.4, "lines"))
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
9.7 Marker genes
Having clustered our spots, we can test for genes that are differentially expressed (DE) between domains (here, using pairwise t-tests and specifically testing for upregulation, i.e., expression should be higher in the cluster for which a gene is reported to be a marker); see Chapter 26.
Code
## [1] 67
We can visualize selected markers as a heatmap where bins represent the average expression (here, log-transformed library size-normalize counts) of a given gene (= columns) in a given cluster (= rows).
scale="column"
. This will, for every gene, subtract the mean and divide by the standard deviation across per-cluster means, bringing genes of potentially very different expression levels to a comparable scale.Code
pbs <- aggregateAcrossCells(spe,
ids=spe$BayesSpace, subset.row=top,
use.assay.type="logcounts", statistics="mean")
# use symbols as feature names
mtx <- t(assay(pbs))
colnames(mtx) <- rowData(pbs)$gene_name
pheatmap(mat=mtx, scale="column")
As well as in space, i.e., coloring spots by their expression of a given marker:
Code
ps <- lapply(
c("MBP", "PLP1", "NRGN", "SNAP25", "NEFL", "HPCAL1"), \(.)
plotSpots(spe, annotate=., feature_names="gene_name", assay_name="logcounts"))
wrap_plots(ps, nrow=2) & theme(
legend.key.width=unit(0.4, "lines"),
legend.key.height=unit(0.8, "lines")) &
scale_color_gradientn(colors=rev(hcl.colors(9, "Rocket")))
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.