library(SpatialExperiment)
library(here)
<- readRDS(here("outputs/spe_logcounts.rds")) spe
8 Feature selection
8.1 Overview
Here we apply feature selection methods to identify highly variable genes (HVGs) or spatially variable genes (SVGs), which can then be investigated individually or used as the input for further downstream analyses such as clustering.
8.2 Load data from previous steps
We start by loading the data object(s) saved after running the analysis steps from the previous chapters. Code to re-run the previous steps is shown in condensed form in Chapter 4.
8.3 Highly variable genes (HVGs)
We use methods from scran
(Lun, McCarthy, and Marioni 2016) to identify a set of top highly variable genes (HVGs), which can be used to define major cell types. These methods were originally developed for single-cell RNA sequencing (scRNA-seq) data, so here we are making the assumption that spots can be treated as equivalent to cells.
Note that HVGs are defined based only on molecular features (i.e. gene expression), and do not take any spatial information into account. If the biologically meaningful spatial information in this dataset mainly reflects spatial distributions of cell types, then relying on HVGs for downstream analyses may be sufficient. However, many datasets contain further spatial structure that is not captured in this way, which may be investigated using spatially-aware methods such as identifying spatially variable genes (SVGs).
To identify HVGs, we first remove mitochondrial genes, since these are very highly expressed in this dataset and are not of main biological interest.
# identify mitochondrial genes
<- grepl("(^MT-)|(^mt-)", rowData(spe)$gene_name)
is_mito table(is_mito)
is_mito
FALSE TRUE
33525 13
# remove mitochondrial genes
<- spe[!is_mito, ]
spe dim(spe)
[1] 33525 3524
Then, we apply methods from scran
. This gives us a list of HVGs, which can be used for further downstream analyses. The parameter prop
defines how many HVGs we want. For example prop = 0.1
returns the top 10% of genes.
library(scran)
# fit mean-variance relationship
<- modelGeneVar(spe)
dec
# visualize mean-variance relationship
<- metadata(dec)
fit plot(fit$mean, fit$var,
xlab = "mean of log-expression", ylab = "variance of log-expression")
curve(fit$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)
# select top HVGs
<- getTopHVGs(dec, prop = 0.1)
top_hvgs length(top_hvgs)
[1] 1438
8.4 Spatially variable genes (SVGs)
Alternatively, we can apply methods to identify spatially variable genes (SVGs) instead of HVGs. Here, we define SVGs as any genes with spatially correlated patterns of expression across the tissue area.
Several methods to identify SVGs in ST data have recently been developed, which each have various methodological and computational tradeoffs. These include:
nnSVG: available as an R package from Bioconductor and described by Weber et al. (2023)
SPARK-X: available as an R package from GitHub and described by Zhu, Sun, and Zhou (2021)
SPARK: available as an R package from GitHub and described by Sun, Zhu, and Zhou (2020)
SpatialDE: available as a Python package from GitHub and described by Svensson, Teichmann, and Stegle (2018)
Alternatively, standard statistical metrics such as Moran’s I statistic or Geary’s C statistic may also be used to rank genes by their observed spatial autocorrelation. However, the methods above tend to be more sensitive, since they have been developed for the specific properties of ST data.
8.4.1 nnSVG
Here, we demonstrate a short example showing how to identify a set of top SVGs using nnSVG (Weber et al. 2023). This method is available in Bioconductor and can be easily integrated into Bioconductor-based workflows.
In this example, we run nnSVG using a small subset of the dataset for faster runtime. We select a subset by subsampling on the set of spots and including stringent filtering for low-expressed genes. A full analysis using all spots for this dataset and default filtering parameters for Visium data from human brain tissue takes around 45 minutes for one Visium sample on a standard laptop.
library(nnSVG)
# subsample spots for faster runtime in this example
# note: skip this step in full analysis
<- 100
n set.seed(123)
<- sample(seq_len(n), n)
ix <- spe[, ix]
spe_nnSVG
# filter low-expressed and mitochondrial genes
# using stringent filtering for faster runtime in this example
# note: use default filtering in full analysis
<- filter_genes(
spe_nnSVG filter_genes_ncounts = 10, filter_genes_pcspots = 3
spe_nnSVG, )
Gene filtering: removing mitochondrial genes
removed 0 mitochondrial genes
Gene filtering: retaining genes with at least 10 counts in at least 3% (n = 3) of spatial locations
removed 33353 out of 33525 genes due to low expression
# re-calculate logcounts after filtering
<- logNormCounts(spe_nnSVG) spe_nnSVG
# run nnSVG
set.seed(123)
<- nnSVG(spe_nnSVG) spe_nnSVG
# investigate results
# show results
head(rowData(spe_nnSVG), 3)
DataFrame with 3 rows and 17 columns
gene_id gene_name feature_type sigma.sq
<character> <character> <character> <numeric>
ENSG00000074800 ENSG00000074800 ENO1 Gene Expression 0.00840777
ENSG00000171603 ENSG00000171603 CLSTN1 Gene Expression 0.43102027
ENSG00000162545 ENSG00000162545 CAMK2N1 Gene Expression 0.15373440
tau.sq phi loglik runtime mean var
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000074800 0.518011 15.82454 -109.808 0.019 1.75820 0.530929
ENSG00000171603 0.298698 18.12310 -123.212 0.021 2.07433 0.741077
ENSG00000162545 0.547736 2.38046 -119.576 0.014 2.63576 0.702846
spcov prop_sv loglik_lm LR_stat rank pval
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000074800 0.0521523 0.0159716 -109.735 -0.145454 167 1.0000000
ENSG00000171603 0.3164987 0.5906665 -126.409 6.394009 86 0.0408845
ENSG00000162545 0.1487576 0.2191603 -123.760 8.368951 70 0.0152302
padj
<numeric>
ENSG00000074800 1.0000000
ENSG00000171603 0.0817690
ENSG00000162545 0.0374227
# number of significant SVGs
table(rowData(spe_nnSVG)$padj <= 0.05)
FALSE TRUE
96 76
# show results for top n SVGs
rowData(spe_nnSVG)[order(rowData(spe_nnSVG)$rank)[1:6], ]
DataFrame with 6 rows and 17 columns
gene_id gene_name feature_type sigma.sq tau.sq
<character> <character> <character> <numeric> <numeric>
ENSG00000197971 ENSG00000197971 MBP Gene Expression 3.92430 0.175336
ENSG00000123560 ENSG00000123560 PLP1 Gene Expression 3.23544 0.461590
ENSG00000109846 ENSG00000109846 CRYAB Gene Expression 1.89968 0.281795
ENSG00000173786 ENSG00000173786 CNP Gene Expression 2.28793 0.402524
ENSG00000131095 ENSG00000131095 GFAP Gene Expression 2.23709 0.461297
ENSG00000160307 ENSG00000160307 S100B Gene Expression 1.24179 0.155737
phi loglik runtime mean var spcov
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000197971 0.93014 -118.372 0.022 3.78073 2.76535 0.523969
ENSG00000123560 1.03983 -140.269 0.018 2.86143 3.02555 0.628613
ENSG00000109846 1.88555 -126.692 0.017 1.86058 1.88123 0.740785
ENSG00000173786 1.02675 -129.206 0.018 1.79558 1.97274 0.842396
ENSG00000131095 2.49083 -149.004 0.018 1.94543 2.71281 0.768823
ENSG00000160307 5.76279 -129.722 0.015 1.82695 1.46946 0.609953
prop_sv loglik_lm LR_stat rank pval padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000197971 0.957231 -192.250 147.7556 1 0.00000e+00 0.00000e+00
ENSG00000123560 0.875146 -196.746 112.9549 2 0.00000e+00 0.00000e+00
ENSG00000109846 0.870824 -172.988 92.5906 3 0.00000e+00 0.00000e+00
ENSG00000173786 0.850388 -175.363 92.3137 4 0.00000e+00 0.00000e+00
ENSG00000131095 0.829047 -191.291 84.5739 5 0.00000e+00 0.00000e+00
ENSG00000160307 0.888562 -160.636 61.8281 6 3.75255e-14 1.07573e-12
# identify top-ranked SVG
rowData(spe_nnSVG)$gene_name[which(rowData(spe_nnSVG)$rank == 1)]
[1] "MBP"
8.4.2 Downstream analyses
The set of top SVGs from nnSVG may then be investigated further, e.g. by plotting the spatial expression of several top genes and by comparing the list of top genes with known gene sets associated with biological processes of interest in the dataset. The set of top SVGs may also be used as the input for further downstream analyses such as spatially-aware clustering to define spatial domains (see Chapter 10).