28 Gene/feature-set signatures
28.1 Preamble
28.1.1 Introduction
Differential gene expression (DGE) analysis between cell subpopulations lets us identify marker genes, i.e., sets of genes that distinguish one group of cells from another. Clustering, however, will always give clusters, and DGE analysis may or may not yield functionally relevant genes (e.g., phenotypic markers).
We can instead quantify sets of genes known to orchestrate biological functions or pathways (e.g., metabolic activity, cell cycle and death), both at level of single cells and subpopulations (i.e., pseudo-bulk profiles). Approaches to do this rely on databases of transcription factor binding sites, gene regulatory networks, or annotated gene sets.
28.1.2 Dependencies
Code
(spe <- readRDS("img-spe_cl.rds"))
## class: SpatialExperiment
## dim: 313 160154
## metadata(6): experiment.xenium transcripts ... technology
## BANKSY_params
## assays(3): counts logcounts H0
## rownames(313): ABCC11 ACTA2 ... ZEB2 ZNF562
## rowData names(3): ID Symbol Type
## colnames(160154): 1 2 ... 167779 167780
## colData names(45): cell_id transcript_counts ... in_tissue label
## reducedDimNames(2): PCA_tx PCA_sp
## mainExpName: NULL
## altExpNames(4): NegControlProbe NegControlCodeword antisense BLANK
## spatialCoords names(2) : x_centroid y_centroid
## imgData names(1): sample_id
28.2 Set scoring
MSigDB provides a programmatic interface to the molecular signatures database (MSigDB), one of the largest collections of molecular signatures and pathways.
Here, we query of human hallmark genes sets:
Code
## [1] 50
## [1] 34 335
Next, we score these with AUCell (Aibar et al. 2017), a rank-based approach that (i) ranks genes for every observation (here, cells), and (ii) compute AUC values for each gene set. These represent the fraction of genes among top-ranked genes (default 5%) that are contained in a given set; i.e., values are in [0,1] and high values = high activity.
Code
# realize (sparse) gene expression matrix
mtx <- as(logcounts(spe), "dgCMatrix")
# use ensembl identifiers as rownames
rownames(mtx) <- rowData(spe)$ID
# filter for genes represented in panel
.gs <- lapply(gs, intersect, rownames(mtx))
# keep only those with at least 5 genes
.gs <- .gs[sapply(.gs, length) >= 5]
# build per-cell gene rankings
rnk <- AUCell_buildRankings(mtx, BPPARAM=bp, plotStats=FALSE, verbose=FALSE)
# calculate AUC for each gene set in each cell
auc <- AUCell_calcAUC(geneSets=.gs, rankings=rnk, nCores=th, verbose=FALSE)
# add results as cell metadata
colData(spe)[rownames(auc)] <- t(assay(auc))
28.2.1 Exploratory
Let’s inspect the percentage of cells with non-zero score across signatures:
## heme_metabolism adipogenesis
## 7.42 11.42
## kras_signaling_dn il2_stat5_signaling
## 14.26 16.58
## inflammatory_response interferon_gamma_response
## 23.29 23.70
Code
tail(fq) # mostly detected
## mtorc1_signaling tnfa_signaling_via_nfkb estrogen_response_late
## 67.33 69.13 80.21
## estrogen_response_early uv_response_dn apoptosis
## 80.86 82.05 93.59
We might also view how signature scores correlated with one another. Besides the underlying biology, this will be influenced by the number of genes overlapping between sets (here, we are only capturing a little over 300 RNA targets!).
Code
Highly correlated sets will exhibit similar spatial patterns. Let’s visualize some examples; here, hypoxia and myogenesis are spatially exclusive (they are negatively correlated and occupy separated ‘blocks’ in the correlation matrix displayed above), while apoptosis is ubiquitous (detected in >90% of cells).
Code
spe$in_tissue <- 1
lapply(c("hypoxia", "apoptosis", "myogenesis"), \(.) {
q <- quantile(x <- spe[[.]], c(0.01, 0.99))
x <- (x-q[1])/diff(q) # 01-quantile scaling
x[x < 0] <- 0; x[x > 1] <- 1; spe[[.]] <- x
# TODO: fix, cuz 'ggspaviz' is messing sth up...
xy <- spatialCoordsNames(spe)
xy <- match(xy, names(colData(spe)), nomatch=0)
if (any(xy != 0)) colData(spe) <- colData(spe)[-xy]
plt <- plotSpots(spe, annotate=.)
plt$layers[[1]]$aes_params$stroke <- 0
plt$layers[[1]]$aes_params$size <- 0.2
plt
}) |>
wrap_plots(nrow=1, guides="collect") &
scale_color_gradientn(
"q-scaled\nAUCell",
colors=hcl.colors(9, "Plasma")) &
theme(
legend.key.height=unit(1, "lines"),
legend.key.width=unit(0.5, "lines"),
panel.background=element_rect(fill="black"))
Lastly, we can aggregate signature scores by cell subpopulations. This can help us functionally characterize subpopulations that stem from, e.g., unsupervised approaches or in cases where the cells under study are poorly characterized.
Code
# aggregate AUC values by cluster
mu <- aggregateAcrossCells(auc, ids=colLabels(spe), use.assay.type="AUC", statistics="mean")
# visualize as (set x cluster) heatmap
pheatmap(mat=assay(mu), scale="row", col=pals::coolwarm())
In a multi-sample setting - say, where sections across different stages of development, treatments, or health and disease - signatures may be compared between experimental conditions. Ideally, this’d be done at the subpopulation level, since compositional differences are deemed to drive differences (e.g., sections richer in CD8+ T cells are, by chance, more likely to score higher for cytotoxicity than sections where CD8+ T cells are absent, or similar).