17  Human DLPFC workflow

17.1 Overview

This workflow analyzes one sample of human brain from the dorsolateral prefrontal cortex (DLPFC) region, measured using the 10x Genomics Visium platform. This is a condensed version of the analyses shown in the individual analysis chapters in the previous part. For more details on the individual steps, see the previous chapters.

17.2 Description of dataset

This is a 10x Genomics Visium dataset generated from healthy human brain samples from the dorsolateral prefrontal cortex (DLPFC) region.

In the full dataset, there are 12 samples in total, from 3 individuals, with 2 pairs of spatially adjacent replicates (serial sections) per individual (4 samples per individual). The individuals and spatially adjacent replicates can be used as blocking factors. Each sample spans the six layers of the cortex plus white matter in a perpendicular tissue section.

For the examples in this workflow and the analysis chapters, we use a single sample from this dataset (sample 151673), to keep the computational requirements to compile the book manageable.

For more details on the dataset, see Maynard et al. (2021). The full dataset is publicly available through the spatialLIBD Bioconductor package. The dataset can also be explored interactively through the spatialLIBD Shiny web app.

17.3 Load data

Here, we load a single sample from this dataset (sample 151673), which is available as a SpatialExperiment object from the STexampleData package.

library(SpatialExperiment)
library(STexampleData)

# load object
spe <- Visium_humanDLPFC()
spe
class: SpatialExperiment 
dim: 33538 4992 
metadata(0):
assays(1): counts
rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
  ENSG00000268674
rowData names(3): gene_id gene_name feature_type
colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
  TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
colData names(7): barcode_id sample_id ... ground_truth cell_count
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

17.4 Plot data

As an initial check, plot the spatial coordinates (spots) in x-y dimensions on the tissue slide, to check that the object has loaded correctly and that the orientation is as expected.

We use visualization functions from the ggspavis package to generate plots.

library(ggspavis)
# plot spatial coordinates (spots)
plotSpots(spe)

17.5 Quality control (QC)

First, we subset the object to keep only spots over tissue. The remaining spots are background spots, which we exclude.

# subset to keep only spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]
dim(spe)
[1] 33538  3639

Next, calculate spot-level QC metrics using the scater package (McCarthy et al. 2017), and store the QC metrics in colData. See Chapter 6 for more details, including explanations of the QC metrics.

library(scater)
# identify mitochondrial genes
is_mito <- grepl("(^MT-)|(^mt-)", rowData(spe)$gene_name)
table(is_mito)
is_mito
FALSE  TRUE 
33525    13 
rowData(spe)$gene_name[is_mito]
 [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
 [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB" 
# calculate per-spot QC metrics and store in colData
spe <- addPerCellQC(spe, subsets = list(mito = is_mito))
head(colData(spe), 3)
DataFrame with 3 rows and 13 columns
                           barcode_id     sample_id in_tissue array_row
                          <character>   <character> <integer> <integer>
AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673         1        50
AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673         1         3
AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673         1        59
                   array_col ground_truth cell_count       sum  detected
                   <integer>  <character>  <integer> <numeric> <numeric>
AAACAAGTATCTCCCA-1       102       Layer3          6      8458      3586
AAACAATCTACTAGCA-1        43       Layer1         16      1667      1150
AAACACCAATAACTGC-1        19           WM          5      3769      1960
                   subsets_mito_sum subsets_mito_detected subsets_mito_percent
                          <numeric>             <numeric>            <numeric>
AAACAAGTATCTCCCA-1             1407                    13              16.6351
AAACAATCTACTAGCA-1              204                    11              12.2376
AAACACCAATAACTGC-1              430                    13              11.4089
                       total
                   <numeric>
AAACAAGTATCTCCCA-1      8458
AAACAATCTACTAGCA-1      1667
AAACACCAATAACTGC-1      3769

Select filtering thresholds for the QC metrics by examining distributions using histograms. For additional details, including further exploratory visualizations to select the thresholds, see Chapter 6. Here, we use relatively relaxed thresholds, since the additional exploratory visualizations showed that more stringent thresholds tended to remove groups of spots corresponding to biologically meaningful regions.

# histograms of QC metrics
par(mfrow = c(1, 4))
hist(colData(spe)$sum, xlab = "sum", main = "UMIs per spot")
hist(colData(spe)$detected, xlab = "detected", main = "Genes per spot")
hist(colData(spe)$subsets_mito_percent, xlab = "percent mitochondrial", main = "Percent mito UMIs")
hist(colData(spe)$cell_count, xlab = "number of cells", main = "No. cells per spot")

par(mfrow = c(1, 1))

# select QC thresholds
qc_lib_size <- colData(spe)$sum < 600
qc_detected <- colData(spe)$detected < 400
qc_mito <- colData(spe)$subsets_mito_percent > 28
qc_cell_count <- colData(spe)$cell_count > 10

# number of discarded spots for each metric
apply(cbind(qc_lib_size, qc_detected, qc_mito, qc_cell_count), 2, sum)
  qc_lib_size   qc_detected       qc_mito qc_cell_count 
            8             7            17            90 
# combined set of discarded spots
discard <- qc_lib_size | qc_detected | qc_mito | qc_cell_count
table(discard)
discard
FALSE  TRUE 
 3524   115 
# store in object
colData(spe)$discard <- discard

Plot the set of discarded spots in the spatial x-y coordinates, to confirm that the spatial distribution of the discarded spots does not correspond to any biologically meaningful regions, which would indicate that we are removing biologically informative spots.

# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot", annotate = "discard")

There is some concentration of discarded spots at the edge of the tissue region, which may be due to tissue damage. Importantly, the discarded spots do not correspond to any of the cortical layers of interest.

We filter out the low-quality spots from the object.

# filter low-quality spots
spe <- spe[, !colData(spe)$discard]
dim(spe)
[1] 33538  3524

17.6 Normalization

Calculate log-transformed normalized counts (logcounts) with the library size factors methodology, using methods from scater (McCarthy et al. 2017) and scran (Lun, McCarthy, and Marioni 2016), making the assumption that spots can be treated as equivalent to cells. For more details, see Chapter 7.

library(scran)
# calculate library size factors
spe <- computeLibraryFactors(spe)

summary(sizeFactors(spe))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1321  0.6312  0.9000  1.0000  1.2849  3.7582 
hist(sizeFactors(spe), breaks = 20)

# calculate logcounts and store in object
spe <- logNormCounts(spe)

assayNames(spe)
[1] "counts"    "logcounts"

17.7 Feature selection

Identify a set of top highly variable genes (HVGs), which will be used to define cell types. We use methods from scran (Lun, McCarthy, and Marioni 2016), treating spots as equivalent to single cells, and considering only molecular features (gene expression) as described in Chapter 8. We also first filter out mitochondrial genes, since these are very highly expressed and not of main biological interest here.

# remove mitochondrial genes
spe <- spe[!is_mito, ]
dim(spe)
[1] 33525  3524
# fit mean-variance relationship
dec <- modelGeneVar(spe)

# visualize mean-variance relationship
fit <- metadata(dec)
plot(fit$mean, fit$var, 
     xlab = "mean of log-expression", ylab = "variance of log-expression")
curve(fit$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)

# select top HVGs
top_hvgs <- getTopHVGs(dec, prop = 0.1)
length(top_hvgs)
[1] 1438

17.8 Spatially-aware feature selection

Alternatively, run nnSVG (Weber et al. 2023) to identify a set of top spatially variable genes (SVGs) instead of HVGs.

Here, we run nnSVG using a small subset of the dataset for faster runtime. We select a subset of the data by subsampling on the set of spots and including stringent filtering for low-expressed genes. For a full analysis, we recommend running nnSVG on all spots and using default filtering parameters (for Visium data from human brain tissue), which takes around 45 minutes for one Visium slide on a standard laptop using multiple cores.

library(nnSVG)
# subsample spots
n <- 100
set.seed(123)
ix <- sample(seq_len(n), n)

spe_nnSVG <- spe[, ix]

# filter low-expressed and mitochondrial genes
# using very stringent filtering parameters for faster runtime in this example
# note: for a full analysis, use alternative filtering parameters (e.g. defaults)
spe_nnSVG <- filter_genes(
  spe_nnSVG, filter_genes_ncounts = 10, filter_genes_pcspots = 3
)
Gene filtering: removing mitochondrial genes
removed 0 mitochondrial genes
Gene filtering: retaining genes with at least 10 counts in at least 3% (n = 3) of spatial locations
removed 33353 out of 33525 genes due to low expression
# re-calculate logcounts after filtering
# using library size factors
spe_nnSVG <- logNormCounts(spe_nnSVG)

# run nnSVG
# using a single core for compatibility on build system
# note: for a full analysis, use multiple cores
set.seed(123)
spe_nnSVG <- nnSVG(spe_nnSVG, n_threads = 1)

# investigate results

# show results
head(rowData(spe_nnSVG), 3)
DataFrame with 3 rows and 17 columns
                        gene_id   gene_name    feature_type   sigma.sq
                    <character> <character>     <character>  <numeric>
ENSG00000074800 ENSG00000074800        ENO1 Gene Expression 0.00840777
ENSG00000171603 ENSG00000171603      CLSTN1 Gene Expression 0.43102027
ENSG00000162545 ENSG00000162545     CAMK2N1 Gene Expression 0.15373440
                   tau.sq       phi    loglik   runtime      mean       var
                <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000074800  0.518011  15.82454  -109.808     0.018   1.75820  0.530929
ENSG00000171603  0.298698  18.12310  -123.212     0.023   2.07433  0.741077
ENSG00000162545  0.547736   2.38046  -119.576     0.016   2.63576  0.702846
                    spcov   prop_sv loglik_lm   LR_stat      rank      pval
                <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000074800 0.0521523 0.0159716  -109.735 -0.145454       167 1.0000000
ENSG00000171603 0.3164987 0.5906665  -126.409  6.394009        86 0.0408845
ENSG00000162545 0.1487576 0.2191603  -123.760  8.368951        70 0.0152302
                     padj
                <numeric>
ENSG00000074800 1.0000000
ENSG00000171603 0.0817690
ENSG00000162545 0.0374227
# number of significant SVGs
table(rowData(spe_nnSVG)$padj <= 0.05)

FALSE  TRUE 
   96    76 
# show results for top n SVGs
rowData(spe_nnSVG)[order(rowData(spe_nnSVG)$rank)[1:6], ]
DataFrame with 6 rows and 17 columns
                        gene_id   gene_name    feature_type  sigma.sq    tau.sq
                    <character> <character>     <character> <numeric> <numeric>
ENSG00000197971 ENSG00000197971         MBP Gene Expression   3.92430  0.175336
ENSG00000123560 ENSG00000123560        PLP1 Gene Expression   3.23544  0.461590
ENSG00000109846 ENSG00000109846       CRYAB Gene Expression   1.89968  0.281795
ENSG00000173786 ENSG00000173786         CNP Gene Expression   2.28793  0.402524
ENSG00000131095 ENSG00000131095        GFAP Gene Expression   2.23709  0.461297
ENSG00000160307 ENSG00000160307       S100B Gene Expression   1.24179  0.155737
                      phi    loglik   runtime      mean       var     spcov
                <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000197971   0.93014  -118.372     0.022   3.78073   2.76535  0.523969
ENSG00000123560   1.03983  -140.269     0.018   2.86143   3.02555  0.628613
ENSG00000109846   1.88555  -126.692     0.017   1.86058   1.88123  0.740785
ENSG00000173786   1.02675  -129.206     0.018   1.79558   1.97274  0.842396
ENSG00000131095   2.49083  -149.004     0.019   1.94543   2.71281  0.768823
ENSG00000160307   5.76279  -129.722     0.016   1.82695   1.46946  0.609953
                  prop_sv loglik_lm   LR_stat      rank        pval        padj
                <numeric> <numeric> <numeric> <numeric>   <numeric>   <numeric>
ENSG00000197971  0.957231  -192.250  147.7556         1 0.00000e+00 0.00000e+00
ENSG00000123560  0.875146  -196.746  112.9549         2 0.00000e+00 0.00000e+00
ENSG00000109846  0.870824  -172.988   92.5906         3 0.00000e+00 0.00000e+00
ENSG00000173786  0.850388  -175.363   92.3137         4 0.00000e+00 0.00000e+00
ENSG00000131095  0.829047  -191.291   84.5739         5 0.00000e+00 0.00000e+00
ENSG00000160307  0.888562  -160.636   61.8281         6 3.75255e-14 1.07573e-12
# identify top-ranked SVG
rowData(spe_nnSVG)$gene_name[which(rowData(spe_nnSVG)$rank == 1)]
[1] "MBP"

17.9 Dimensionality reduction

Run principal component analysis (PCA) on the set of top HVGs, and retain the top 50 principal components (PCs) for further downstream analyses. This is done both to reduce noise and to improve computational efficiency. We also run UMAP on the set of top 50 PCs and retain the top 2 UMAP components for visualization purposes.

We use the computationally efficient implementation of PCA available in scater (McCarthy et al. 2017), which uses randomization, and therefore requires setting a random seed for reproducibility.

# compute PCA
set.seed(123)
spe <- runPCA(spe, subset_row = top_hvgs)

reducedDimNames(spe)
[1] "PCA"
dim(reducedDim(spe, "PCA"))
[1] 3524   50
# compute UMAP on top 50 PCs
set.seed(123)
spe <- runUMAP(spe, dimred = "PCA")

reducedDimNames(spe)
[1] "PCA"  "UMAP"
dim(reducedDim(spe, "UMAP"))
[1] 3524    2
# update column names for easier plotting
colnames(reducedDim(spe, "UMAP")) <- paste0("UMAP", 1:2)

17.10 Clustering

Next, we perform clustering to define cell types. Here, we use molecular features (gene expression) only, as described in Chapter 10. We apply graph-based clustering using the Walktrap method implemented in scran (Lun, McCarthy, and Marioni 2016), applied to the top 50 PCs calculated on the set of top HVGs.

# graph-based clustering
set.seed(123)
k <- 10
g <- buildSNNGraph(spe, k = k, use.dimred = "PCA")
g_walk <- igraph::cluster_walktrap(g)
clus <- g_walk$membership
table(clus)
clus
   1    2    3    4    5    6    7 
 338  312 1146  978  274  116  360 
# store cluster labels in column 'label' in colData
colLabels(spe) <- factor(clus)

Visualize the clusters by plotting in spatial (x-y) coordinates on the tissue slide, and in UMAP dimensions.

From the visualizations, we can see that the clustering reproduces the known biological structure (cortical layers), although not perfectly. The clusters are also separated in UMAP space, but again not perfectly.

# plot clusters in spatial x-y coordinates
plotSpots(spe, annotate = "label", 
          pal = "libd_layer_colors")

# plot ground truth labels in spatial coordinates
plotSpots(spe, annotate = "ground_truth", 
          pal = "libd_layer_colors")

# plot clusters in UMAP reduced dimensions
plotDimRed(spe, plot_type = "UMAP", 
           annotate = "label", pal = "libd_layer_colors")

17.11 Differential expression

Identify marker genes by testing for differential gene expression between clusters. We use the findMarkers implementation in scran (Lun, McCarthy, and Marioni 2016), using a binomial test, which tests for genes that differ in the proportion expressed vs. not expressed between clusters. This is a more stringent test than the default t-tests, and tends to select genes that are easier to interpret and validate experimentally.

# set gene names as row names for easier plotting
rownames(spe) <- rowData(spe)$gene_name

# test for marker genes
markers <- findMarkers(spe, test = "binom", direction = "up")

# returns a list with one DataFrame per cluster
markers
List of length 7
names(7): 1 2 3 4 5 6 7
library(pheatmap)
# plot log-fold changes for one cluster over all other clusters
# selecting cluster 1
interesting <- markers[[1]]
best_set <- interesting[interesting$Top <= 5, ]
logFCs <- getMarkerEffects(best_set)

pheatmap(logFCs, breaks = seq(-5, 5, length.out = 101))

# plot log-transformed normalized expression of top genes for one cluster
top_genes <- head(rownames(interesting))

plotExpression(spe, x = "label", features = top_genes)