In this demo, we will be analyzing Visium data on a human colorectal cancer biopsy from Oliveira et al. (2024) (Chapter 6). Rather than recapitulating all possible analyses, our goal is to highlight those that might be of particular interest in the context of these data.
# determine outliers via thresholding on MAD from the medianol<-perCellQCFilters(spe, sub.fields="subsets_mt_percent")# add results as cell metadatacolData(spe)[names(ol)]<-ol# tabulate # and % of cells that'd # be discarded for different reasonsdata.frame( check.names=FALSE, `#`=apply(ol, 2, sum), `%`=round(100*apply(ol, 2, mean), 2))
It seems like low-quality spots are highly spatially organized, so that might might decide to refrain from removing them, for now. We will see further below how the quality control metrics used here, and spots deemed to be discarded, are distributed across (transcription-based) clusters.
In a complementary approach, we deconvolute spot measurements using (annotated) reference single-cell data provided by the authors. Let’s first retrieve these data, alongside corresponding cell metadata, which includes low- (Level1) and high-resolution (Level2) annotations into 10 and 32 subpopulations, respectively.
Next, we perform deconvolution with spacexr (f.k.a. RCTD)(Cable et al. 2022). By default, runRctd()’s rctd_mode="doublet", i.e., at most two subpopulations are fit per pixel; here, we set rctd_mode="full" in order to allow for an arbitrary number of subpopulations to be fit instead.
Here, we filter the reference data to contain only cells from the same patient, and downsample to retain a limited number of cells per subpopulation. This is not strictly necessary, assuming that clusters are transcriptionally stable across patients, but helps with reducing runtime and memory consumption here.
Code
# prep reference data (Chromium);# subset cells from same patient.sce<-sce[, grepl("P2", sce$Patient)]# downsample to at most 4,000 cells per clustercs<-split(seq_len(ncol(.sce)), .sce$Level1)cs<-lapply(cs, \(.)sample(., min(length(.), 4e3))).sce<-.sce[, unlist(cs)]# run 'RCTD' deconvolutionobj<-createRctd(spe, .sce, cell_type_col="Level1", max_cores=10)(res<-runRctd(obj, rctd_mode="full"))
Weights inferred by RCDT should be normalized such that proportions of cell types sum to 1 in each spot:
Code
# scale weights such that they sum to 1ws<-assay(res)ws<-sweep(ws, 2, colSums(ws), `/`)# add proportion estimates as metadataws<-data.frame(t(as.matrix(ws)))colData(spe)[names(ws)]<-ws[colnames(spe), ]
For comparison with unsupervised clustering (SNN-based Leiden), we also include assignments we would obtain if we were to assign spots the most frequent label (in terms of deconvolution estimates):
To help characterize subpopulations from unsupervised clustering, we can view their distribution across deconvolution-based clusters and broad domains; e.g., tumor spots are quite divers, while smooth muscle spots and (normal) epithelia map almost completely to a single cluster:
Let’s inspect the key drivers of (expression) variability in terms of PCs. Considering clustering and deconvolution results from above, we can see that
PC1 distinguishes stromal from both normal and malignant epithelia;
PC2 clearly separates (normal) intestinal epithelium from all else;
PC3 captures a fibroblast-rich region, and normal epithelia;
PC5 separates fibroblasts and smooth muscle cells; etc.
Quality control metrics tend to be low for specific clusters. Their patch-like pattern, in turn, explains the clustering of low-quality spots seen earlier.
Rather than investigating single genes, we can also evaluate the expression of sets of genes (e.g., pathway signatures); e.g., malignant tissue may differ in metabolic activity such as glycolysis and fatty acid metabolism, or exhibit increased apoptosis (cell death) etc. Here, we retrieve hallmark gene sets for some biological phenomena from MSigDB, using the msigdbr package:
Note that such gene lists like these may come from many different places - e.g., in-house analyses, other publications befitting the research question etc. As such, they may also be read into are from a .csv file, or stem from upstream analyses. In any case, they should be curated well and interpreted with caution.
Code
# retrieve hallmark gene sets from 'MSigDB'db<-msigdbr(species="Homo sapiens", category="H")# get list of gene symbols, one element per setgs<-split(db$ensembl_gene, db$gs_name)# simplify set identifiers (drop prefix, use lower case)names(gs)<-tolower(gsub("HALLMARK_", "", names(gs)))# how many sets?length(gs)
## [1] 50
Code
# how many genes in each?range(sapply(gs, length))
## [1] 34 335
Next, we will score these using AUCell(Aibar et al. 2017), which works in two steps: (i) rank genes for every observation (here, spots), and (ii) compute AUC values for each gene set. In essence, these represent the fraction of genes (within top-ranked genes; default 5%) that are in a given set; i.e., high values correspond to high activity (in terms of coordinated gene expression).
By definition, AUCell yields values in [0,1]. Larger sets (more genes) are more likely to achieve higher scores by chance (e.g., a gene set of all genes would score 1 in any dataset). It is thus unfair to compare them directly. However, spatial distribution, correlation, and relative comparisons between subpopulations etc. are still meaningful.
Code
# realize (sparse) gene expression matrixmtx<-as(logcounts(spe), "dgCMatrix")# use ensembl identifiers as feature namesrownames(mtx)<-rowData(spe)$ID# build per-spot gene rankingsrnk<-AUCell_buildRankings(mtx, BPPARAM=bp, plotStats=FALSE, verbose=FALSE)# calculate AUC for each gene set in each spotauc<-AUCell_calcAUC(geneSets=gs, rankings=rnk, nCores=th, verbose=FALSE)# add results as spot metadatacolData(spe)[rownames(auc)]<-res<-t(assay(auc))
For simplicity, we’ll continue investigating only those signatures with the highest score variability across spots:
Code
var<-colVars(res)# variance across spotstop<-names(tail(sort(var), 8))# top sets)
To summarize, MYC signalling is absent in stromal regions; the fibroblast ring surrounding a cancerous patch exhibits EMT, angiogenesis, etc.; INFa response and TNFa signalling is patch-like in both stroma and malignant epithelia.
To ease interpretability, we can stratify AUCell scores by spot labels; these may stem from an unsupervised or deconvolution-based approach:
Code
for(.inc("Leiden", "RCTD")){# aggregate AUC values by clustermu<-aggregateAcrossCells(auc[top, ], spe[[.]], use.assay.type="AUC", statistics="mean")# visualize as (cluster x set) heatmappheatmap( mat=t(assay(mu)), scale="column", col=pals::coolwarm(), main=., cellwidth=10, cellheight=10, treeheight_row=5, treeheight_col=5)}
For the latter, we may instead correlate set scores with proportion estimates (rather than discretizing labels according to the dominant subpopulation):
Code
# correlate 'AUCell' signature scores with subpopulation# proportion estimates from deconvolution with 'RCTD'cm<-cor(as.matrix(ws), t(assay(auc[top, ])))
Aibar, Sara, Carmen Bravo González-Blas, Thomas Moerman, Vân Anh Huynh-Thu, Hana Imrichova, Gert Hulselmans, Florian Rambow, et al. 2017. “SCENIC: Single-Cell Regulatory Network Inference and Clustering.”Nature Methods 14 (11): 1083–86.
Cable, Dylan M., Evan Murray, Luli S. Zou, Aleksandrina Goeva, Evan Z. Macosko, Fei Chen, and Rafael A. Irizarry. 2022. “Robust Decomposition of Cell Type Mixtures in Spatial Transcriptomics.”Nature Biotechnology 40 (4): 517–26. https://doi.org/10.1038/s41587-021-00830-w.
Oliveira, Michelli F., Juan P. Romero, Meii Chung, Stephen Williams, Andrew D. Gottscho, Anushka Gupta, Susan E. Pilipauskas, et al. 2024. “Characterization of Immune Cell Populations in the Tumor Microenvironment of Colorectal Cancer Using High Definition Spatial Profiling.”bioRxiv. https://doi.org/10.1101/2024.06.04.597233.