23 Spatially variable genes

23.1 Preamble

23.1.1 Introduction

Unsupervised clustering and label-transfer approaches yield (discrete, non-overlapping) grouping of cells by transcriptional – and presumably functional – similarity. Identifying differentially expressed genes (DEGs), i.e., genes that are up-/down-regulated in one or few subpopulation(s), helps characterize clusters (and, in an unsupervised setting, find biologically meaningful cluster labels).

By contrast, methods to identify spatially variable genes (SVGs) aim to find genes with spatially correlated patterns of expression. Feature selection in terms of SVGs can be used as a spatially-aware alternative to mean-variance relationship-based HVGs (c.f. Chapter 9), or to identify biologically informative genes as candidates for experimental follow-up.

Here, we demonstrate selected methods to identify SVGs de novo (using nnSVG), and based on pre-computed spatial clusters (using DESpace). We further compare these to DEGs as well as HVGs, and discuss the conceptual similarities and differences between genes identified through these different approaches.

23.1.2 Dependencies

Code

library(DESpace)
library(dplyr)
library(ggspavis)
library(ggrepel)
library(nnSVG)
library(patchwork)
library(scater)
library(scran)
library(SpatialExperiment)
library(tidyr)

Code

# set seed for random number generation
# in order to make results reproducible
set.seed(123)
# load data from previous chapters
# (post quality control & clustering)
spe <- readRDS("seq-spe_cl.rds")

23.2 Non-spatial

We can also use methods originally developed for bulk and single-cell transcriptomics, such as those for detecting highly variable genes (HVGs) and differentially expressed genes (DEGs), on spatial omics data. Note that HVGs are defined only based on molecular features (i.e., gene expression), and do not incorporate spatial information. In the context of single-cell and ST, differentially expressed (DE) may refer to differences between groups of samples, or to differences between clusters (e.g., cell types, or spatial domains); below, we consider changes across spatial structures.

23.2.1 Highly variable genes (HVGs)

For a comprehensive discussion on HVG selection, we refer readers to OSCA. Below, we take a standard approach for selecting HVGs. For more details and examples, see Chapter 9.

Code

# fit mean-variance relationship
dec <- modelGeneVar(spe)

23.2.2 Differentially expressed genes (DEGs)

For details on identifying DE genes between groups of cells from single-cell RNA-seq data, we refer readers to OSCA. Below, we show a potential approach, using marker detection methods on ST data, to identify DE genes between spatial clusters.

Code

# differential gene expression analysis
mgs <- findMarkers(spe, groups=spe$BayesSpace, direction="up")
# select for a few markers per cluster
deg <- lapply(mgs, \(df) rownames(df)[df$Top <= 3])
length(deg <- unique(unlist(deg)))

##  [1] 68

23.3 Spatially-aware

Here, we demonstrate brief examples of how to identify a set of top SVGs using (nnSVG (Weber et al. 2023) and DESpace (Cai, Robinson, and Tiberi 2024)). These methods are available through Bioconductor and can be easily integrated into Bioconductor-based workflows.

23.3.1 Spatially-variable genes (SVGs)

SVGs are usually identified integrating gene expression measurements with spatial coordinates, either with or without predefined spatial domains. SVGs approaches can be broadly categorized into three types: overall SVGs, spatial domain-specific SVGs, and cell type-specific SVGs (Yan, Hua, and Li 2024).

The detection of overall SVGs is sometimes used as a feature selection step for further downstream analyses such as spatially-aware clustering (see Chapter 22). Several methods have been proposed for detecting overall SVGs; below, we report some few notable examples:

spatialDE (Svensson, Teichmann, and Stegle 2018) and nnSVG (Weber et al. 2023), which are based on a Gaussian process model;
Moran’s I and Geary’s C, which ranks genes according to their observed spatial autocorrelation (see Chapter 25); and,
SPARK (Sun, Zhu, and Zhou 2020; Zhu, Sun, and Zhou 2021), which uses a non-parametric test of the covariance matrices of the spatial expression data.

Spatial domain-specific SVGs target changes in gene expression between spatial domains, which are used to summarize the whole spatial information. Genes displaying expression changes across spatial clusters indicate SVGs. Spatial domains can be predefined based on morphology knowledge, or identified through spatially-aware clustering approaches such as BayesSpace (Zhao et al. 2021) and Banksy (Singhal et al. 2024). Methods that belong to this category include DESpace (Cai, Robinson, and Tiberi 2024) in R, and SpaGCN (Hu et al. 2021) in Python.

Cell type-specific SVG methods leverage external cell type annotations to identify SVGs within cell types. They analyze interaction effects between cell types and spatial coordinates (Yan, Hua, and Li 2024). Examplary methods in R include CTSV (J. Yu and Luo 2022), C-SIDE (implemented in spacexr; (Cable et al. 2022)), and spVC (S. Yu and Li 2024).

23.3.2 nnSVG

In this example, we use a small subset of the dataset for faster runtime. We select a subset of the data, by subsampling the set of spots and including stringent filtering for lowly expressed genes.

A full analysis with nnSVG using all spots for this dataset and default filtering parameters for an individual Visium sample from human brain tissue (available from spatialLIBD) takes around 45 minutes on a standard laptop.

Code

# set random seed for number generation
# in order to make results reproducible
set.seed(123)
# sample 100 spots to decrease runtime in this demo
# (note: skip this step in full analysis)
n <- 100
sub <- spe[, sample(ncol(spe), 100)]
# filter lowly expressed genes using stringent
# criteria to decrease runtime in this demo
# (note: use default criteria in full analysis)
sub <- filter_genes(sub,     # filter for genes with...
    filter_genes_ncounts=10, # at least 10 counts in
    filter_genes_pcspots=3)  # at least 3% of spots
# re-normalize counts post-filtering
sub <- logNormCounts(sub)

# run nnSVG
set.seed(123)
sub <- nnSVG(sub)

# extract gene-level results
res_nnSVG <- rowData(sub)
# show results
head(res_nnSVG, 3)

##  DataFrame with 3 rows and 17 columns
##                          gene_id   gene_name    feature_type  sigma.sq
##                      <character> <character>     <character> <numeric>
##  ENSG00000162545 ENSG00000162545     CAMK2N1 Gene Expression  0.554287
##  ENSG00000117632 ENSG00000117632       STMN1 Gene Expression  0.273690
##  ENSG00000142937 ENSG00000142937        RPS8 Gene Expression  1.216073
##                       tau.sq       phi    loglik   runtime      mean
##                    <numeric> <numeric> <numeric> <numeric> <numeric>
##  ENSG00000162545 6.26826e-01   5.62281  -141.482     0.009   1.95696
##  ENSG00000117632 7.12078e-01   5.82105  -137.703     0.011   1.96579
##  ENSG00000142937 3.66418e-05  43.88534  -150.198     0.024   1.70346
##                        var     spcov   prop_sv loglik_lm   LR_stat      rank
##                  <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##  ENSG00000162545  1.125278  0.380440  0.469292  -147.293  11.62212        30
##  ENSG00000117632  0.994145  0.266129  0.277642  -141.098   6.78938        53
##  ENSG00000142937  1.221055  0.647365  0.999970  -151.377   2.35774        80
##                        pval      padj
##                   <numeric> <numeric>
##  ENSG00000162545 0.00299425 0.0108791
##  ENSG00000117632 0.03355091 0.0690009
##  ENSG00000142937 0.30762644 0.4158654

Code

# count significant SVGs (at 5% FDR significance level)
table(res_nnSVG$padj <= 0.05)

##  
##  FALSE  TRUE 
##     59    50

Code

# identify the top-ranked SVGs
res_nnSVG$gene_name[res_nnSVG$rank == 1]

##  [1] "MBP"

Code

# store the top 7 SVGs
top_nnSVG <- res_nnSVG$gene_name[res_nnSVG$rank < 7.5]

23.3.3 `DESpace`

DESapce relies on pre-computed spatial domains to summarize the primary spatial structures of the data; see Chapter 22 on clustering.

Code

plotCoords(spe, 
    annotate="BayesSpace") +
    theme(legend.key.size=unit(0, "lines")) +
    scale_color_manual(values=unname(pals::trubetskoy()))

Code

# run DESpace
res <- svg_test(spe, cluster_col="BayesSpace")
head(res_DESpace <- res$gene_results)

##                          gene_id        LR   logCPM PValue FDR
##  ENSG00000123560 ENSG00000123560 18353.979 11.96618      0   0
##  ENSG00000197971 ENSG00000197971 28398.191 12.63393      0   0
##  ENSG00000150625 ENSG00000150625  4282.059 11.43082      0   0
##  ENSG00000163536 ENSG00000163536  2643.441 10.77555      0   0
##  ENSG00000269028 ENSG00000269028  5911.076 11.75599      0   0
##  ENSG00000143013 ENSG00000143013  2987.734 10.69452      0   0

Code

# count significant SVGs (at 5% FDR significance level)
table(res_DESpace$FDR <= 0.05)

##  
##  FALSE  TRUE 
##   2833 12582

Downstream analyses

23.4 Downstream analyses

The set of top SVGs may be further investigated, e.g. by plotting the spatial expression of several top genes and via gene pathway analyses (i.e., comparing significant SVGs with known gene sets associated with specific biological functions).

Code

top_HVGs <- getTopHVGs(dec, n=(n <- 6))
top_DEGs <- lapply(mgs, \(df) rownames(df)[df$Top == 1])
top_DEGs <- unique(unlist(top_DEGs))[seq_len(6)]
top_DESpace <- res_DESpace$gene_id[seq_len(6)]

23.4.1 Visualization

To visualize the expression levels of selected genes in spatial coordinates on the tissue slide, we can use plotting functions from the ggspavis package.

Code

# get top SVGs from each method
gs <- list(
    HVGs=top_HVGs, 
    DEGs=top_DEGs, 
    DESpace=top_DESpace)
# get gene symbols from ensembl identifiers
idx <- match(unlist(gs), rowData(spe)$gene_id)
.gs <- rowData(spe)$gene_name[idx]
# expression plots for each top gene
ps <- lapply(seq_along(.gs), \(.) {
    plotCoords(spe, 
        point_size=0,
        annotate=.gs[.], 
        assay_name="logcounts", 
        feature_names="gene_name") + 
        if (. %% 6 == 1) list(
            ylab(names(gs)[ceiling(./6)]), 
            theme(axis.title.y=element_text()))
}) 
wrap_plots(ps, nrow=3) & theme(
    legend.key.width=unit(0.4, "lines"),
    legend.key.height=unit(0.8, "lines")) &
    scale_color_gradientn(colors = pals::parula())

23.4.2 Comparison

We can compare the ranks of the genes detected by each method and compute pairwise correlations, highlighting two known cortical layer-associated SVGs: MOBP and SNAP25. While all four methods - HVGs, DEGs, SVGs identified through nnSVG and DESpace - yield similar gene ranks, each method captures distinct aspects of gene expression, with varying pairwise correlations between them.

Code

# subset data for the 113 genes nnSVG is based on
sub_gene <- rowData(sub)$gene_id
sub_DESpace <- res_DESpace[sub_gene, ] 
sub_HVG <- as.data.frame(dec[sub_gene, ])

# aggregate DEGs for each spatial domain
subset_mgs <- lapply(mgs, \(x) x[sub_gene, 1:3] |> as.data.frame())
sub_DEG <- do.call(cbind, subset_mgs) 
top_cols <- sub_DEG |> select(ends_with(".Top"))
sub_DEG$Top <- do.call(pmin, c(top_cols, na.rm = TRUE))

# compute ranks 
sub_DESpace$rank <- rank(sub_DESpace$FDR, ties.method = "first")
sub_HVG$rank <- rank(-1 * sub_HVG$bio, ties.method = "first")
sub_DEG$rank <- rank(sub_DEG$Top, ties.method = "first")

# combine 'rank' column for each method
res_all <- list(
    nnSVG = as.data.frame(res_nnSVG),
    DESpace = sub_DESpace, HVGs = sub_HVG, DEGs = sub_DEG
)
rank_all <- do.call(cbind, lapply(res_all, `[[`, "rank"))
rank_all <- data.frame(gene_id = sub_gene, rank_all)

Code

# known SVGs for this data
known_genes <- c("MOBP", "SNAP25")
# method names
method_names <- c("nnSVG", "DESpace", "HVGs", "DEGs")
# convert data structure
rank_long <- rank_all |>
    left_join(data.frame(rowData(sub)[, c("gene_id", "gene_name")]), by = "gene_id") |>
    select(all_of(c("gene_name", method_names))) |>
    pivot_longer(cols = c(nnSVG, DESpace, HVGs, DEGs), 
                 names_to = "method", 
                 values_to = "rank")
# all pairwise comparisons
df_pairs <- t(combn(method_names, 2)) |> as.data.frame()
# function to plot each pairwise comparison
plot_pairwise_comparison <- function(m1, m2, df) {
  # filter the data for the two methods being compared
  df <- df |>
      filter(method %in% c(m1, m2)) |>
      pivot_wider(names_from = method, values_from = rank)
  # compute pearson correlation between methods
  cor_val <- cor(df[[m1]], df[[m2]], method = "spearman")
  # plot
  ggplot(df, aes(x = .data[[m1]], y = .data[[m2]])) +
      geom_point() + 
      geom_text_repel(data = df %>% filter(gene_name %in% known_genes), 
                      aes(label = gene_name), color = "red", size = 3.25, 
                      nudge_x = 50, nudge_y = 10, box.padding = 0.5) +
      labs(x = paste(m1, "rank"), y = paste(m2, "rank"),
           title = paste(m2, "vs.", m1, ": Cor = ", round(cor_val, 2))) +
      #scale_color_manual(values = c("darkorange", "firebrick3", "deepskyblue2")) +
      theme_bw() + coord_fixed() + 
      xlim(c(0, 120)) + ylim(c(0, 120))
}

# generate and display all pairwise comparison plots
plots <- lapply(seq_len(nrow(df_pairs)), \(i) {
    plot_pairwise_comparison(df_pairs[i, 2], df_pairs[i, 1],rank_long)
})

wrap_plots(plots, ncol = 3)

##  Warning: Removed 1 row containing missing values or values outside the scale range
##  (`geom_text_repel()`).

23.5 Appendix

Reviews

Das Adhikari et al. (2024)
Yan, Hua, and Li (2024)

Benchmarks

Li et al. (2023) compares 14 methods using 60 simulated datasets, generated using four different strategies, 12 experimental ST datasets, and 3 spatial ATAC-seq datasets.
Chen, Kim, and Yang (2024) compares 8 methods based on 22 experimental datasets (8 technologies, 6 tissue types) and 9 datasets simulated using scDesign3 (Song et al. 2024).

References

Cable, Dylan M, Evan Murray, Vignesh Shanmugam, Simon Zhang, Luli S Zou, Michael Diao, Haiqi Chen, Evan Z Macosko, Rafael A Irizarry, and Fei Chen. 2022. “Cell Type-Specific Inference of Differential Expression in Spatial Transcriptomics.” Nature Methods, 1–12.

Cai, Peiying, Mark D Robinson, and Simone Tiberi. 2024. “DESpace: Spatially Variable Gene Detection via Differential Expression Testing of Spatial Clusters.” Bioinformatics 40 (2): btae027.

Chen, Carissa, Hani Jieun Kim, and Pengyi Yang. 2024. “Evaluating Spatially Variable Gene Detection Methods for Spatial Transcriptomics Data.” Genome Biology 25 (1): 18.

Das Adhikari, Sikta, Jiaxin Yang, Jianrong Wang, and Yuehua Cui. 2024. “Recent Advances in Spatially Variable Gene Detection in Spatial Transcriptomics.” Computational and Structural Biotechnology Journal 23: 883–91.

Hu, Jian, Xiangjie Li, Kyle Coleman, Amelia Schroeder, Nan Ma, David J. Irwin, Edward B. Lee, Russell T. Shinohara, and Mingyao Li. 2021. “SpaGCN: Integrating Gene Expression, Spatial Location and Histology to Identify Spatial Domains and Spatially Variable Genes by Graph Convolutional Network.” Nature Methods 18: 1342–51. https://doi.org/10.1038/s41592-021-01255-8.

Li, Zhijian, Zain M Patel, Dongyuan Song, Guanao Yan, Jingyi Jessica Li, and Luca Pinello. 2023. “Benchmarking Computational Methods to Identify Spatially Variable Genes and Peaks.” bioRxiv, 2023.12.02.569717.

Singhal, Vipul, Nigel Chou, Joseph Lee, Yifei Yue, Jinyue Liu, Wan Kee Chock, Li Lin, et al. 2024. “BANKSY Unifies Cell Typing and Tissue Domain Segmentation for Scalable Spatial Omics Data Analysis.” Nature Genetics 56 (3): 431–41.

Song, Dongyuan, Qingyang Wang, Guanao Yan, Tianyang Liu, Tianyi Sun, and Jingyi Jessica Li. 2024. “scDesign3 Generates Realistic in Silico Data for Multimodal Single-Cell and Spatial Omics.” Nature Biotechnology 42 (2): 247–52.

Sun, Shiquan, Jiaqiang Zhu, and Xiang Zhou. 2020. “Statistical Analysis of Spatial Expression Patterns for Spatially Resolved Transcriptomic Studies.” Nature Methods 17: 193–200. https://doi.org/10.1038/s41592-019-0701-7.

Svensson, Valentine, Sarah A. Teichmann, and Oliver Stegle. 2018. “SpatialDE: Identification of Spatially Variable Genes.” Nature Methods 15: 343–46. https://doi.org/10.1038/nmeth.4636.

Weber, Lukas M., Arkajyoti Saha, Abhirup Datta, Kasper D. Hansen, and Stephanie C. Hicks. 2023. “nnSVG for the Scalable Identification of Spatially Variable Genes Using Nearest-Neighbor Gaussian Processes.” Nature Communications 14: 4059. https://doi.org/10.1038/s41467-023-39748-z.

Yan, Guanao, Shuo Harper Hua, and Jingyi Jessica Li. 2024. “Categorization of 33 Computational Methods to Detect Spatially Variable Genes from Spatially Resolved Transcriptomics Data.” arXiv, May.

Yu, Jinge, and Xiangyu Luo. 2022. “Identification of Cell-Type-Specific Spatially Variable Genes Accounting for Excess Zeros.” Bioinformatics 38 (17): 4135–44.

Yu, Shan, and Wei Vivian Li. 2024. “spVC for the Detection and Interpretation of Spatial Gene Expression Variation.” Genome Biology 25 (1): 103.

Zhao, Edward, Matthew R. Stone, Xing Ren, Jamie Guenthoer, Kimberly S. Smythe, Thomas Pulliam, Stephen R. Williams, et al. 2021. “Spatial Transcriptomics at Subspot Resolution with BayesSpace.” Nature Biotechnology 39: 1375–84. https://doi.org/10.1038/s41587-021-00935-2.

Zhu, Jiaqiang, Shiquan Sun, and Xiang Zhou. 2021. “SPARK-X: Non-Parametric Modeling Enables Scalable and Robust Detection of Spatial Expression Patterns for Large Spatial Transcriptomic Studies.” Genome Biology 22: 184. https://doi.org/10.1186/s13059-021-02404-0.

23.1 Preamble

23.1.1 Introduction

23.1.2 Dependencies

23.2 Non-spatial

23.2.1 Highly variable genes (HVGs)

23.2.2 Differentially expressed genes (DEGs)

23.3 Spatially-aware

23.3.1 Spatially-variable genes (SVGs)

23.3.2 nnSVG

23.3.3 DESpace

23.4 Downstream analyses

23.4.1 Visualization

23.4.2 Comparison

23.5 Appendix

Reviews

Benchmarks

References

23.3.3 `DESpace`