15 Workflow: Human DLPFC

15.1 Introduction

This workflow analyzes one sample of human brain from the dorsolateral prefrontal cortex (DLPFC) region, measured using the 10x Genomics Visium platform. This is a condensed version of the analyses shown in the individual analysis chapters in the previous part. For more details on the individual steps, see the previous chapters.

15.2 Description of dataset

This is a 10x Genomics Visium dataset generated from healthy human brain samples from the dorsolateral prefrontal cortex (DLPFC) region.

In the full dataset, there are 12 samples in total, from 3 individuals, with 2 pairs of spatially adjacent replicates (serial sections) per individual (4 samples per individual). The individuals and spatially adjacent replicates can be used as blocking factors. Each sample spans the six layers of the cortex plus white matter in a perpendicular tissue section.

For the examples in this workflow and the analysis chapters, we use a single sample from this dataset (sample 151673), to keep the computational requirements to compile the book manageable.

For more details on the dataset, see Maynard et al. (2021). The full dataset is publicly available through the spatialLIBD Bioconductor package. The dataset can also be explored interactively through the spatialLIBD Shiny web app.

15.3 Load data

Here, we load a single sample from this dataset (sample 151673), which is available as a SpatialExperiment object from the STexampleData package.

library(SpatialExperiment)
library(STexampleData)

# load object
spe <- Visium_humanDLPFC()
spe

##  class: SpatialExperiment 
##  dim: 33538 4992 
##  metadata(0):
##  assays(1): counts
##  rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##    ENSG00000268674
##  rowData names(3): gene_id gene_name feature_type
##  colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
##    TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
##  colData names(8): barcode_id sample_id ... reference cell_count
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
##  imgData names(4): sample_id image_id data scaleFactor

15.4 Plot data

As an initial check, plot the spatial coordinates (spots) in x-y dimensions on the tissue slide, to check that the object has loaded correctly and that the orientation is as expected.

We use visualization functions from the ggspavis package to generate plots.

library(ggspavis)

# plot spatial coordinates (spots)
plotSpots(spe)

15.5 Quality control (QC)

First, we subset the object to keep only spots over tissue. The remaining spots are background spots, which we exclude.

# subset to keep only spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]
dim(spe)

##  [1] 33538  3639

Next, calculate spot-level QC metrics using the scater package (McCarthy et al. 2017), and store the QC metrics in colData. See Chapter 6 for more details, including explanations of the QC metrics.

library(scater)

# identify mitochondrial genes
is_mito <- grepl("(^MT-)|(^mt-)", rowData(spe)$gene_name)
table(is_mito)

##  is_mito
##  FALSE  TRUE 
##  33525    13

rowData(spe)$gene_name[is_mito]

##   [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
##   [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB"

# calculate per-spot QC metrics and store in colData
spe <- addPerCellQC(spe, subsets = list(mito = is_mito))
head(colData(spe), 3)

##  DataFrame with 3 rows and 14 columns
##                             barcode_id     sample_id in_tissue array_row
##                            <character>   <character> <integer> <integer>
##  AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673         1        50
##  AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673         1         3
##  AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673         1        59
##                     array_col ground_truth   reference cell_count       sum
##                     <integer>  <character> <character>  <integer> <numeric>
##  AAACAAGTATCTCCCA-1       102       Layer3      Layer3          6      8458
##  AAACAATCTACTAGCA-1        43       Layer1      Layer1         16      1667
##  AAACACCAATAACTGC-1        19           WM          WM          5      3769
##                      detected subsets_mito_sum subsets_mito_detected
##                     <numeric>        <numeric>             <numeric>
##  AAACAAGTATCTCCCA-1      3586             1407                    13
##  AAACAATCTACTAGCA-1      1150              204                    11
##  AAACACCAATAACTGC-1      1960              430                    13
##                     subsets_mito_percent     total
##                                <numeric> <numeric>
##  AAACAAGTATCTCCCA-1              16.6351      8458
##  AAACAATCTACTAGCA-1              12.2376      1667
##  AAACACCAATAACTGC-1              11.4089      3769

Select filtering thresholds for the QC metrics by examining distributions using histograms. For additional details, including further exploratory visualizations to select the thresholds, see Chapter 6. Here, we use relatively relaxed thresholds, since the additional exploratory visualizations showed that more stringent thresholds tended to remove groups of spots corresponding to biologically meaningful regions.

# histograms of QC metrics
par(mfrow = c(1, 4))
hist(colData(spe)$sum, xlab = "sum", main = "UMIs per spot")
hist(colData(spe)$detected, xlab = "detected", main = "Genes per spot")
hist(colData(spe)$subsets_mito_percent, xlab = "percent mitochondrial", main = "Percent mito UMIs")
hist(colData(spe)$cell_count, xlab = "number of cells", main = "No. cells per spot")

par(mfrow = c(1, 1))

# select QC thresholds
qc_lib_size <- colData(spe)$sum < 600
qc_detected <- colData(spe)$detected < 400
qc_mito <- colData(spe)$subsets_mito_percent > 28
qc_cell_count <- colData(spe)$cell_count > 10

# number of discarded spots for each metric
apply(cbind(qc_lib_size, qc_detected, qc_mito, qc_cell_count), 2, sum)

##    qc_lib_size   qc_detected       qc_mito qc_cell_count 
##              8             7            17            90

# combined set of discarded spots
discard <- qc_lib_size | qc_detected | qc_mito | qc_cell_count
table(discard)

##  discard
##  FALSE  TRUE 
##   3524   115

# store in object
colData(spe)$discard <- discard

Plot the set of discarded spots in the spatial x-y coordinates, to confirm that the spatial distribution of the discarded spots does not correspond to any biologically meaningful regions, which would indicate that we are removing biologically informative spots.

# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot", annotate = "discard")

There is some concentration of discarded spots at the edge of the tissue region, which may be due to tissue damage. Importantly, the discarded spots do not correspond to any of the cortical layers of interest.

We filter out the low-quality spots from the object.

# filter low-quality spots
spe <- spe[, !colData(spe)$discard]
dim(spe)

##  [1] 33538  3524

15.6 Normalization

Calculate log-transformed normalized counts (logcounts) with the library size factors methodology, using methods from scater (McCarthy et al. 2017) and scran (Lun, McCarthy, and Marioni 2016), making the assumption that spots can be treated as equivalent to cells. For more details, see Chapter 7.

library(scran)

# calculate library size factors
spe <- computeLibraryFactors(spe)

summary(sizeFactors(spe))

##     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.1321  0.6312  0.9000  1.0000  1.2849  3.7582

hist(sizeFactors(spe), breaks = 20)

# calculate logcounts and store in object
spe <- logNormCounts(spe)

assayNames(spe)

##  [1] "counts"    "logcounts"

15.7 Feature selection

Identify a set of top highly variable genes (HVGs), which will be used to define cell types. We use methods from scran (Lun, McCarthy, and Marioni 2016), treating spots as equivalent to single cells, and considering only molecular features (gene expression) as described in Chapter 8. We also first filter out mitochondrial genes, since these are very highly expressed and not of main biological interest here.

# remove mitochondrial genes
spe <- spe[!is_mito, ]
dim(spe)

##  [1] 33525  3524

# fit mean-variance relationship
dec <- modelGeneVar(spe)

# visualize mean-variance relationship
fit <- metadata(dec)
plot(fit$mean, fit$var, 
     xlab = "mean of log-expression", ylab = "variance of log-expression")
curve(fit$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)

# select top HVGs
top_hvgs <- getTopHVGs(dec, prop = 0.1)
length(top_hvgs)

##  [1] 1438

15.8 Spatially-aware feature selection

Alternatively, run nnSVG (Weber et al. 2023) to identify a set of top spatially variable genes (SVGs) instead of HVGs.

Here, we run nnSVG using a small subset of the dataset for faster runtime. We select a subset of the data by subsampling on the set of spots and including stringent filtering for low-expressed genes. For a full analysis, we recommend running nnSVG on all spots and using default filtering parameters (for Visium data from human brain tissue), which takes around 45 minutes for one Visium slide on a standard laptop using multiple cores.

library(nnSVG)

# subsample spots
n <- 100
set.seed(123)
ix <- sample(seq_len(n), n)

spe_nnSVG <- spe[, ix]

# filter low-expressed and mitochondrial genes
# using very stringent filtering parameters for faster runtime in this example
# note: for a full analysis, use alternative filtering parameters (e.g. defaults)
spe_nnSVG <- filter_genes(
  spe_nnSVG, filter_genes_ncounts = 10, filter_genes_pcspots = 3
)

##  Gene filtering: removing mitochondrial genes

##  removed 0 mitochondrial genes

##  Gene filtering: retaining genes with at least 10 counts in at least 3% (n = 3) of spatial locations

##  removed 33353 out of 33525 genes due to low expression

# re-calculate logcounts after filtering
# using library size factors
spe_nnSVG <- logNormCounts(spe_nnSVG)

# run nnSVG
# using a single core for compatibility on build system
# note: for a full analysis, use multiple cores
set.seed(123)
spe_nnSVG <- nnSVG(spe_nnSVG, n_threads = 1)

# investigate results

# show results
head(rowData(spe_nnSVG), 3)

##  DataFrame with 3 rows and 18 columns
##                          gene_id   gene_name    feature_type subsets_mito
##                      <character> <character>     <character>    <logical>
##  ENSG00000074800 ENSG00000074800        ENO1 Gene Expression        FALSE
##  ENSG00000171603 ENSG00000171603      CLSTN1 Gene Expression        FALSE
##  ENSG00000162545 ENSG00000162545     CAMK2N1 Gene Expression        FALSE
##                    sigma.sq    tau.sq       phi    loglik   runtime      mean
##                   <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##  ENSG00000074800 0.00840777  0.518011  15.82454  -109.808     0.015   1.75820
##  ENSG00000171603 0.43102027  0.298698  18.12310  -123.212     0.018   2.07433
##  ENSG00000162545 0.15373440  0.547736   2.38046  -119.576     0.012   2.63576
##                        var     spcov   prop_sv loglik_lm   LR_stat      rank
##                  <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##  ENSG00000074800  0.530929 0.0521523 0.0159716  -109.735 -0.145454       167
##  ENSG00000171603  0.741077 0.3164987 0.5906665  -126.409  6.394009        86
##  ENSG00000162545  0.702846 0.1487576 0.2191603  -123.760  8.368951        70
##                       pval      padj
##                  <numeric> <numeric>
##  ENSG00000074800 1.0000000 1.0000000
##  ENSG00000171603 0.0408845 0.0817690
##  ENSG00000162545 0.0152302 0.0374227

# number of significant SVGs
table(rowData(spe_nnSVG)$padj <= 0.05)

##  
##  FALSE  TRUE 
##     96    76

# show results for top n SVGs
rowData(spe_nnSVG)[order(rowData(spe_nnSVG)$rank)[1:6], ]

##  DataFrame with 6 rows and 18 columns
##                          gene_id   gene_name    feature_type subsets_mito
##                      <character> <character>     <character>    <logical>
##  ENSG00000197971 ENSG00000197971         MBP Gene Expression        FALSE
##  ENSG00000123560 ENSG00000123560        PLP1 Gene Expression        FALSE
##  ENSG00000109846 ENSG00000109846       CRYAB Gene Expression        FALSE
##  ENSG00000173786 ENSG00000173786         CNP Gene Expression        FALSE
##  ENSG00000131095 ENSG00000131095        GFAP Gene Expression        FALSE
##  ENSG00000160307 ENSG00000160307       S100B Gene Expression        FALSE
##                   sigma.sq    tau.sq       phi    loglik   runtime      mean
##                  <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##  ENSG00000197971   3.92430  0.175336   0.93014  -118.372     0.019   3.78073
##  ENSG00000123560   3.23544  0.461590   1.03983  -140.269     0.016   2.86143
##  ENSG00000109846   1.89968  0.281795   1.88555  -126.692     0.014   1.86058
##  ENSG00000173786   2.28793  0.402524   1.02675  -129.206     0.014   1.79558
##  ENSG00000131095   2.23709  0.461297   2.49083  -149.004     0.021   1.94543
##  ENSG00000160307   1.24179  0.155737   5.76279  -129.722     0.013   1.82695
##                        var     spcov   prop_sv loglik_lm   LR_stat      rank
##                  <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##  ENSG00000197971   2.76535  0.523969  0.957231  -192.250  147.7556         1
##  ENSG00000123560   3.02555  0.628613  0.875146  -196.746  112.9549         2
##  ENSG00000109846   1.88123  0.740785  0.870824  -172.988   92.5906         3
##  ENSG00000173786   1.97274  0.842396  0.850388  -175.363   92.3137         4
##  ENSG00000131095   2.71281  0.768823  0.829047  -191.291   84.5739         5
##  ENSG00000160307   1.46946  0.609953  0.888562  -160.636   61.8281         6
##                         pval        padj
##                    <numeric>   <numeric>
##  ENSG00000197971 0.00000e+00 0.00000e+00
##  ENSG00000123560 0.00000e+00 0.00000e+00
##  ENSG00000109846 0.00000e+00 0.00000e+00
##  ENSG00000173786 0.00000e+00 0.00000e+00
##  ENSG00000131095 0.00000e+00 0.00000e+00
##  ENSG00000160307 3.75255e-14 1.07573e-12

# identify top-ranked SVG
rowData(spe_nnSVG)$gene_name[which(rowData(spe_nnSVG)$rank == 1)]

##  [1] "MBP"

15.9 Dimensionality reduction

Run principal component analysis (PCA) on the set of top HVGs, and retain the top 50 principal components (PCs) for further downstream analyses. This is done both to reduce noise and to improve computational efficiency. We also run UMAP on the set of top 50 PCs and retain the top 2 UMAP components for visualization purposes.

We use the computationally efficient implementation of PCA available in scater (McCarthy et al. 2017), which uses randomization, and therefore requires setting a random seed for reproducibility.

# compute PCA
set.seed(123)
spe <- runPCA(spe, subset_row = top_hvgs)

reducedDimNames(spe)

##  [1] "PCA"

dim(reducedDim(spe, "PCA"))

##  [1] 3524   50

# compute UMAP on top 50 PCs
set.seed(123)
spe <- runUMAP(spe, dimred = "PCA")

reducedDimNames(spe)

##  [1] "PCA"  "UMAP"

dim(reducedDim(spe, "UMAP"))

##  [1] 3524    2

# update column names for easier plotting
colnames(reducedDim(spe, "UMAP")) <- paste0("UMAP", 1:2)

15.10 Clustering

Next, we perform clustering to define cell types. Here, we use molecular features (gene expression) only, as described in Chapter 10. We apply graph-based clustering using the Walktrap method implemented in scran (Lun, McCarthy, and Marioni 2016), applied to the top 50 PCs calculated on the set of top HVGs.

# graph-based clustering
set.seed(123)
k <- 10
g <- buildSNNGraph(spe, k = k, use.dimred = "PCA")
g_walk <- igraph::cluster_walktrap(g)
clus <- g_walk$membership
table(clus)

##  clus
##     1    2    3    4    5    6 
##   359 1187  447  291  693  547

# store cluster labels in column 'label' in colData
colLabels(spe) <- factor(clus)

Visualize the clusters by plotting in spatial (x-y) coordinates on the tissue slide, and in UMAP dimensions.

From the visualizations, we can see that the clustering reproduces the known biological structure (cortical layers), although not perfectly. The clusters are also separated in UMAP space, but again not perfectly.

# plot clusters in spatial x-y coordinates
plotSpots(spe, annotate = "label", 
          pal = "libd_layer_colors")

# plot ground truth labels in spatial coordinates
plotSpots(spe, annotate = "ground_truth", 
          pal = "libd_layer_colors")

# plot clusters in UMAP reduced dimensions
plotDimRed(spe, plot_type = "UMAP", 
           annotate = "label", pal = "libd_layer_colors")

15.11 Differential expression

Identify marker genes by testing for differential gene expression between clusters. We use the findMarkers implementation in scran (Lun, McCarthy, and Marioni 2016), using a binomial test, which tests for genes that differ in the proportion expressed vs. not expressed between clusters. This is a more stringent test than the default t-tests, and tends to select genes that are easier to interpret and validate experimentally.

# set gene names as row names for easier plotting
rownames(spe) <- rowData(spe)$gene_name

# test for marker genes
markers <- findMarkers(spe, test = "binom", direction = "up")

# returns a list with one DataFrame per cluster
markers

##  List of length 6
##  names(6): 1 2 3 4 5 6

library(pheatmap)

# plot log-fold changes for one cluster over all other clusters
# selecting cluster 1
interesting <- markers[[1]]
best_set <- interesting[interesting$Top <= 5, ]
logFCs <- getMarkerEffects(best_set)

pheatmap(logFCs, breaks = seq(-5, 5, length.out = 101))

# plot log-transformed normalized expression of top genes for one cluster
top_genes <- head(rownames(interesting))

plotExpression(spe, x = "label", features = top_genes)

References

Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-seq Data with Bioconductor.” F1000Research 5 (2122). https://doi.org/10.12688/f1000research.9501.2.

Maynard, Kristen R., Leonardo Collado-Torres, Lukas M. Weber, Cedric Uytingco, Brianna K. Barry, Stephen R. Williams, Joseph L. Catallini II, et al. 2021. “Transcriptome-Scale Spatial Gene Expression in the Human Dorsolateral Prefrontal Cortex.” Nature Neuroscience 24: 425–36. https://doi.org/10.1038/s41593-020-00787-0.

McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Wills. 2017. “Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell RNA-seq Data in R.” Bioinformatics 33 (8): 1179–86. https://doi.org/10.1093/bioinformatics/btw777.

Weber, Lukas M., Arkajyoti Saha, Abhirup Datta, Kasper D. Hansen, and Stephanie C. Hicks. 2023. “nnSVG for the Scalable Identification of Spatially Variable Genes Using Nearest-Neighbor Gaussian Processes.” Nature Communications 14: 4059. https://doi.org/10.1038/s41467-023-39748-z.