24 Feature selection & testing
24.1 Preamble
24.1.1 Introduction
Unsupervised clustering and label-transfer approaches yield (discrete, non-overlapping) grouping of cells by transcriptional – and presumably functional – similarity. Identifying differentially expressed genes (DEGs), i.e., genes that are up-/down-regulated in one or few subpopulation(s), helps characterize clusters (and, in an unsupervised setting, find biologically meaningful cluster labels).
By contrast, methods to identify spatially variable genes (SVGs) aim to find genes with spatially correlated patterns of expression. Feature selection in terms of SVGs can be used as a spatially-aware alternative to mean-variance relationship-based HVGs (see Chapter 10), or to identify biologically informative genes as candidates for experimental follow-up.
Here, we demonstrate selected methods to identify SVGs de novo (using nnSVG
), and based on pre-computed spatial clusters (using DESpace
). We further compare these to DEGs as well as HVGs, and discuss the conceptual similarities and differences between genes identified through these different approaches.
24.1.2 Dependencies
24.2 Non-spatial
We can also use methods originally developed for bulk and single-cell transcriptomics, such as those for detecting highly variable genes (HVGs) and differentially expressed genes (DEGs), on spatial omics data. Note that HVGs are defined only based on molecular features (i.e., gene expression), and do not incorporate spatial information. In the context of single-cell and ST, differentially expressed (DE) may refer to differences between groups of samples, or to differences between clusters (e.g., cell types, or spatial domains); below, we consider changes across spatial structures.
24.2.1 Highly variable genes (HVGs)
Feature selection is used as a preprocessing step to identify a subset of biologically informative features (e.g. genes), in order to reduce noise (due to both technical and biological factors) and improve computational performance.
Identifying a set of top “highly variable genes” (HVGs) is a standard step for feature selection in many scRNA-seq workflows, which can also be used as a simple and fast baseline approach in spot-based ST data, or high-plex imaging based ST data; for the former, this makes the simplified assumption that spots can be treated as equivalent to cells.
The set of top HVGs can then be used as the input for subsequent steps, such as dimensionality reduction and clustering. For a comprehensive discussion on HVG selection, we refer readers to OSCA.
In this example, we use the scran package (Lun, McCarthy, and Marioni 2016) to identify HVGs in a two-step procedure. We first model the mean-variance relationship, which decomposes variance into a technical component (smooth fit) and biological component (deviation thereof). Secondly, HVGs may be selected based on their rank, using a fixed number n
or proportion p
of genes:
FDR.threshold
on the significance of large variances relative to other genes.Code
# fit mean-variance relationship, decomposing
# variance into technical & biological components
dec <- modelGeneVar(spe)
# select top 2,000 ranked genes (in terms
# of their biological variance component)
hvg <- getTopHVGs(dec, n=2e3)
Code
24.2.2 Differentially expressed genes (DEGs)
Having clustered our spots (or cells), we can test for differentially expressed genes (DEGs) between clusters. These can be interpreted as marker genes and used to interpret clusters in terms of their function, and to help annotate them (i.e. assign biologically meaningful labels).
For details on identifying DE genes between groups of cells from scRNA-seq data, we refer readers to OSCA.
Here, we will use pairwise t-tests and specifically test for upregulation (as opposed to downregulation), i.e. expression should be higher in the cluster for which a gene is reported to be a marker; see Chapter 23 for details.
Code
## [1] 61
We can visualize selected marker genes as a heatmap where bins represent the average expression (here, log-transformed library size-normalized counts) of a given gene (= columns) in a given cluster (= rows).
scale = "column"
. This will, for every gene, subtract the mean and divide by the standard deviation across per-cluster means, bringing genes of potentially very different expression levels to a comparable scale.Code
# compute cluster-wise averages
pbs <- aggregateAcrossCells(spe,
ids = spe$BayesSpace, subset.row = deg,
use.assay.type = "logcounts", statistics = "mean")
# use gene symbols as feature names
mtx <- t(assay(pbs))
colnames(mtx) <- rowData(pbs)$gene_name
# using pheatmap package
pheatmap(mat = mtx, scale = "column")
In addition, we can plot in x-y space, i.e. coloring spots by their expression of a given marker gene:
Code
# select top-3 markers for each cluster & get gene symbols
gs <- unique(unlist(lapply(mgs, \(df) head(rownames(df), 3))))
gs <- rowData(spe)$gene_name[match(gs, rownames(spe))]
# gene-wise spatial plots
ps <- lapply(gs, \(.) {
plotCoords(spe,
annotate = .,
feature_names = "gene_name",
assay_name = "logcounts") })
# figure arrangement
wrap_plots(ps, nrow = 4) &
theme(legend.key.width = unit(0.4, "lines"),
legend.key.height = unit(0.8, "lines")) &
scale_color_gradientn(colors = rev(hcl.colors(9, "Rocket")))
24.3 Spatially-aware
Here, we demonstrate brief examples of how to identify a set of top SVGs using (nnSVG (Weber et al. 2023) and DESpace (Cai, Robinson, and Tiberi 2024)). These methods are available through Bioconductor and can be easily integrated into Bioconductor-based workflows.
24.3.1 Spatially-variable genes (SVGs)
SVGs are usually identified integrating gene expression measurements with spatial coordinates, either with or without predefined spatial domains. SVGs approaches can be broadly categorized into three types: overall SVGs, spatial domain-specific SVGs, and cell type-specific SVGs (Yan, Hua, and Li 2024).
The detection of overall SVGs is sometimes used as a feature selection step for further downstream analyses such as spatially-aware clustering (see Chapter 23). Several methods have been proposed for detecting overall SVGs; below, we report some few notable examples:
- spatialDE (Svensson, Teichmann, and Stegle 2018) and nnSVG (Weber et al. 2023), which are based on a Gaussian process model;
- Moran’s I and Geary’s C, which ranks genes according to their observed spatial autocorrelation (see Chapter 26); and,
- SPARK (Sun, Zhu, and Zhou 2020; Zhu, Sun, and Zhou 2021), which uses a non-parametric test of the covariance matrices of the spatial expression data.
Spatial domain-specific SVGs target changes in gene expression between spatial domains, which are used to summarize the whole spatial information. Genes displaying expression changes across spatial clusters indicate SVGs. Spatial domains can be predefined based on morphology knowledge, or identified through spatially-aware clustering approaches such as BayesSpace (Zhao et al. 2021) and Banksy (Singhal et al. 2024). Methods that belong to this category include DESpace (Cai, Robinson, and Tiberi 2024) in R, and SpaGCN (Hu et al. 2021) in Python.
Cell type-specific SVG methods leverage external cell type annotations to identify SVGs within cell types. They analyze interaction effects between cell types and spatial coordinates (Yan, Hua, and Li 2024). Examplary methods in R include CTSV (J. Yu and Luo 2022), C-SIDE (implemented in spacexr; (Cable et al. 2022)), and spVC (S. Yu and Li 2024).
24.3.2 nnSVG
In this example, we use a small subset of the dataset for faster runtime. We select a subset of the data, by subsampling the set of spots and including stringent filtering for lowly expressed genes.
nnSVG
using all spots for this dataset and default filtering parameters for an individual Visium sample from human brain tissue (available from spatialLIBD) takes around 45 minutes on a standard laptop.Code
# set random seed for number generation
# in order to make results reproducible
set.seed(123)
# sample 100 spots to decrease runtime in this demo
# (note: skip this step in full analysis)
n <- 100
sub <- spe[, sample(ncol(spe), 100)]
# filter lowly expressed genes using stringent
# criteria to decrease runtime in this demo
# (note: use default criteria in full analysis)
sub <- filter_genes(sub, # filter for genes with...
filter_genes_ncounts=10, # at least 10 counts in
filter_genes_pcspots=3) # at least 3% of spots
# re-normalize counts post-filtering
sub <- logNormCounts(sub)
# run nnSVG
set.seed(123)
sub <- nnSVG(sub)
# extract gene-level results
res_nnSVG <- rowData(sub)
# show results
head(res_nnSVG, 3)
## DataFrame with 3 rows and 18 columns
## gene_id gene_name feature_type subsets_mito
## <character> <character> <character> <logical>
## ENSG00000171603 ENSG00000171603 CLSTN1 Gene Expression FALSE
## ENSG00000162545 ENSG00000162545 CAMK2N1 Gene Expression FALSE
## ENSG00000117632 ENSG00000117632 STMN1 Gene Expression FALSE
## sigma.sq tau.sq phi loglik runtime
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000171603 0.141341 7.45588e-01 8.46518 -134.893 0.011
## ENSG00000162545 0.815466 8.45972e-07 40.91220 -129.116 0.017
## ENSG00000117632 0.388235 5.47433e-01 6.11768 -131.196 0.009
## mean var spcov prop_sv loglik_lm LR_stat
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000171603 2.04692 0.890887 0.183668 0.159360 -135.614 1.44197
## ENSG00000162545 2.49986 0.839524 0.361233 0.999999 -132.645 7.05809
## ENSG00000117632 2.65231 0.899943 0.234922 0.414928 -136.120 9.84819
## rank pval padj
## <numeric> <numeric> <numeric>
## ENSG00000171603 111 0.48627310 0.6352216
## ENSG00000162545 54 0.02933290 0.0787643
## ENSG00000117632 43 0.00726929 0.0245127
Code
# count significant SVGs (at 5% FDR significance level)
table(res_nnSVG$padj <= 0.05)
##
## FALSE TRUE
## 96 49
Code
# identify the top-ranked SVGs
res_nnSVG$gene_name[res_nnSVG$rank == 1]
## [1] "MBP"
Code
# store the top 7 SVGs
top_nnSVG <- res_nnSVG$gene_name[res_nnSVG$rank < 7.5]
24.3.3 DESpace
DESapce
relies on pre-computed spatial domains to summarize the primary spatial structures of the data; see Chapter 23 on clustering.
Code
plotCoords(spe,
annotate="BayesSpace") +
theme(legend.key.size=unit(0, "lines")) +
scale_color_manual(values=unname(pals::trubetskoy()))
Code
## gene_id LR logCPM PValue FDR
## ENSG00000104722 ENSG00000104722 1575.194 10.06849 0 0
## ENSG00000277586 ENSG00000277586 2663.015 10.77511 0 0
## ENSG00000139970 ENSG00000139970 2447.830 11.30299 0 0
## ENSG00000014641 ENSG00000014641 1748.432 10.62861 0 0
## ENSG00000170027 ENSG00000170027 1645.794 10.85739 0 0
## ENSG00000123416 ENSG00000123416 2618.592 11.76458 0 0
Code
# count significant SVGs (at 5% FDR significance level)
table(res_DESpace$FDR <= 0.05)
##
## FALSE TRUE
## 4895 10241
24.4 Downstream analyses
The set of top SVGs may be further investigated, e.g. by plotting the spatial expression of several top genes and via gene pathway analyses (i.e., comparing significant SVGs with known gene sets associated with specific biological functions).
24.4.1 Visualization
To visualize the expression levels of selected genes in spatial coordinates on the tissue slide, we can use plotting functions from the ggspavis package.
Code
# get top SVGs from each method
gs <- list(
HVGs=top_HVGs,
DEGs=top_DEGs,
DESpace=top_DESpace)
# get gene symbols from ensembl identifiers
idx <- match(unlist(gs), rowData(spe)$gene_id)
.gs <- rowData(spe)$gene_name[idx]
# expression plots for each top gene
ps <- lapply(seq_along(.gs), \(.) {
plotCoords(spe,
point_size=0,
annotate=.gs[.],
assay_name="logcounts",
feature_names="gene_name") +
if (. %% 6 == 1) list(
ylab(names(gs)[ceiling(./6)]),
theme(axis.title.y=element_text()))
})
wrap_plots(ps, nrow=3) & theme(
legend.key.width=unit(0.4, "lines"),
legend.key.height=unit(0.8, "lines")) &
scale_color_gradientn(colors = pals::parula())
24.4.2 Comparison
We can compare the ranks of the genes detected by each method and compute pairwise correlations, highlighting two known cortical layer-associated SVGs: MOBP and SNAP25. While all four methods - HVGs, DEGs, SVGs identified through nnSVG
and DESpace
- yield similar gene ranks, each method captures distinct aspects of gene expression, with varying pairwise correlations between them.
Code
# subset data for the 113 genes nnSVG is based on
sub_gene <- rowData(sub)$gene_id
sub_DESpace <- res_DESpace[sub_gene, ]
sub_HVG <- as.data.frame(dec[sub_gene, ])
# aggregate DEGs for each spatial domain
subset_mgs <- lapply(mgs, \(x) x[sub_gene, 1:3] |> as.data.frame())
sub_DEG <- do.call(cbind, subset_mgs)
top_cols <- sub_DEG |> select(ends_with(".Top"))
sub_DEG$Top <- do.call(pmin, c(top_cols, na.rm = TRUE))
# compute ranks
sub_DESpace$rank <- rank(sub_DESpace$FDR, ties.method = "first")
sub_HVG$rank <- rank(-1 * sub_HVG$bio, ties.method = "first")
sub_DEG$rank <- rank(sub_DEG$Top, ties.method = "first")
# combine 'rank' column for each method
res_all <- list(
nnSVG = as.data.frame(res_nnSVG),
DESpace = sub_DESpace, HVGs = sub_HVG, DEGs = sub_DEG
)
rank_all <- do.call(cbind, lapply(res_all, `[[`, "rank"))
rank_all <- data.frame(gene_id = sub_gene, rank_all)
Code
# known SVGs for this data
known_genes <- c("MOBP", "SNAP25")
# method names
method_names <- c("nnSVG", "DESpace", "HVGs", "DEGs")
# convert data structure
rank_long <- rank_all |>
left_join(data.frame(rowData(sub)[, c("gene_id", "gene_name")]), by = "gene_id") |>
select(all_of(c("gene_name", method_names))) |>
pivot_longer(cols = c(nnSVG, DESpace, HVGs, DEGs),
names_to = "method",
values_to = "rank")
# all pairwise comparisons
df_pairs <- t(combn(method_names, 2)) |> as.data.frame()
# function to plot each pairwise comparison
plot_pairwise_comparison <- function(m1, m2, df) {
# filter the data for the two methods being compared
df <- df |>
filter(method %in% c(m1, m2)) |>
pivot_wider(names_from = method, values_from = rank)
# compute pearson correlation between methods
cor_val <- cor(df[[m1]], df[[m2]], method = "spearman")
# plot
ggplot(df, aes(x = .data[[m1]], y = .data[[m2]])) +
geom_point() +
geom_text_repel(data = df %>% filter(gene_name %in% known_genes),
aes(label = gene_name), color = "red", size = 3.25,
nudge_x = 50, nudge_y = 10, box.padding = 0.5) +
labs(x = paste(m1, "rank"), y = paste(m2, "rank"),
title = paste(m2, "vs.", m1, ": Cor = ", round(cor_val, 2))) +
#scale_color_manual(values = c("darkorange", "firebrick3", "deepskyblue2")) +
theme_bw() + coord_fixed() +
xlim(c(0, 120)) + ylim(c(0, 120))
}
# generate and display all pairwise comparison plots
plots <- lapply(seq_len(nrow(df_pairs)), \(i) {
plot_pairwise_comparison(df_pairs[i, 2], df_pairs[i, 1],rank_long)
})
wrap_plots(plots, ncol = 3)
24.5 Appendix
Reviews
Benchmarks
- Li et al. (2023) compares 14 methods using 60 simulated datasets, generated using four different strategies, 12 experimental ST datasets, and 3 spatial ATAC-seq datasets.
- Chen, Kim, and Yang (2024) compares 8 methods based on 22 experimental datasets (8 technologies, 6 tissue types) and 9 datasets simulated using scDesign3 (Song et al. 2024).