17 Quality control

17.1 Preamble

17.1.1 Introduction

The analyses demonstrated here will use a 313-plex Xenium data (10x Genomics) of a human breast cancer biopsy section (Janesick et al. 2023) and a 1k-plex CosMx data (NanoString) of human non-small cell lung cancer (NSCLC).

In the first section, we demonstrate QC procedures applied to the Xenium dataset, and save the post-QC dataset for use in later chapters. In the second section, we demonstrate additional CosMx-specific QC procedures on the CosMx dataset.

These datasets can be accessed through the STexampleData package as part of Bioconductor’s ExperimentHub.

17.1.2 Dependencies

Code

library(SpatialExperiment)
library(STexampleData)
library(scater)
library(dplyr)
library(tidyr)
library(ggplot2)
library(patchwork)

17.1.3 Load Xenium dataset

Load the Xenium dataset from the STexampleData package.

Code

xen <- Janesick_breastCancer_Xenium_rep1()

Plot the x-y coordinates.

17.1.4 Load CosMx dataset

Load the CosMx dataset from the STexampleData package.

Code

cos <- CosMx_lungCancer()

17.2 QC using Xenium dataset

17.2.1 Exploratory analyses

Calculate QC metrics using scater package.

Code

spe <- xen

Code

spe <- addPerCellQCMetrics(spe, use.altexps = TRUE)
df <- data.frame(colData(spe), spatialCoords(spe))

Imaging-based ST data relies on imaging a 2D tissue section. Cells thus represent a slice through a 3D system, and we expect a tight relationship between a cell’s counts and area.

When we inspect total counts and cell areas as separate (univariate) distributions, both appear to follow approximate log-normal distributions.

Code

fd <- pivot_longer(df, c("cell_area", "total_counts"))
mu <- summarise_at(group_by(fd, name), "value", median)

ggplot(fd, aes(value)) + facet_grid(~ name) + 
  geom_histogram(bins = 50, linewidth = 0.1, fill = "gray") + 
  geom_vline(data = mu, aes(xintercept = value), col = "blue") + 
  geom_text(
    hjust = -0.1, size = 3, col = "blue", 
    data = mu, aes(value, 0, label=round(value))) + 
  scale_x_continuous(NULL, trans = "log10") + ylab("# cells") + 
  theme_minimal() + 
  theme(panel.grid.minor = element_blank())

##  Warning in scale_x_continuous(NULL, trans = "log10"): log-10 transformation
##  introduced infinite values.

##  Warning: Removed 1415 rows containing non-finite outside the scale range
##  (`stat_bin()`).

Similarly, we can visualize the distribution of these metrics in the tissue by plotting cell centroids colored by total counts and cell area, respectively.

Code

fd <- df |>
  pivot_longer(c(cell_area, total_counts)) |> 
  arrange(desc(value)) |> 
  # crop low / high values for clearer visualization
  mutate(value = case_when(value < 10 ~ 10, value > 1e3 ~ 1e3, TRUE ~ value))

ggplot(fd, aes(x_centroid, y_centroid, col = value)) + 
  geom_point(shape = 19, stroke = 0, size = 0.2) + 
  facet_grid(~ name) + 
  scale_color_gradientn(
    colors = hcl.colors(9, "Roma"), 
    limits = c(10, 1e3), trans = "log10") + 
  coord_equal() + 
  theme_void()

In a bivariate setting, however, we observe a bimodal distribution with subsets of cells having more or less than 1 count per \(\mu m^2\) (diagonal line).

Code

df <- data.frame(colData(spe), spatialCoords(spe))

ggplot(df, aes(cell_area, total_counts)) + 
  geom_point(shape = 16, size = 0.1, alpha = 0.1) + 
  geom_abline(intercept = 0, slope = 1, col = "blue") + 
  scale_x_continuous(trans = "log10") + 
  scale_y_continuous(trans = "log10") + 
  theme_minimal() + 
  theme(aspect.ratio = 1, 
        panel.grid.minor = element_blank())

##  Warning in scale_y_continuous(trans = "log10"): log-10 transformation
##  introduced infinite values.

In some cases, a peculiar count-area distribution might be indicative of segmentation faults. When groups of cells are merged together, for example, we might observe fewer counts than expected for a given area (when they are far apart, i.e., empty space is being segmented).

Considering the spatial organization of cells falling above / below 1 count per \(\mu m^2\) here, it is reasonable to assume that they represent distinct transcriptional states (not technical artefacts).

Code

hl <- df$total_counts / df$cell_area > 1
df$hl <- c("gray", "blue")[hl + 1]

ggplot(df, aes(x_centroid, y_centroid, col = hl)) + 
  geom_point(shape = 19, stroke = 0, size = 0.2, show.legend = FALSE) + 
  scale_color_identity() + 
  coord_equal(expand=FALSE) + 
  theme_void()

17.2.2 Filtering

For this demo, we will employ rather stringent filtering criteria to:

exclude cells with too few counts per area (thresholding on MADs)
exclude cells with any negative probe or system control counts

Code

nc <- spe$total_counts / spe$cell_area
ol <- isOutlier(nc, log = TRUE, type = "lower", nmads = 3)
(th <- attr(ol, "threshold")[1])

##      lower 
##  0.1714161

Code

par(mar = c(4, 4, 0, 0))
hist(log(nc), n = 1e3, 
     main = NULL, ylab = "# cells", 
     xlab = "log(total_counts / cell_area)")
abline(v = log(th), col = "blue")

Code

mean(ex <- ol | 
       spe$control_probe_counts > 0 | 
       spe$control_codeword_counts > 0)

##  [1] 0.1639766

Before finalizing to exclude any cells, it is recommended to inspect where these fall in the tissue. We would expect low-quality cells to be distributed randomly, and otherwise to accumulate at the tissue borders or in regions were tissue was, for example, smudged, detached, necrotic, etc.

In general, spatially organized clusters of excluded cells might indicate a bias towards excluding specific types of cells (e.g. some cells tend to be smaller or might otherwise contain fewer counts due to the panel design).

Whenever possible, we advise readers to cross-check visualizations such as the following with a corresponding H&E stain of the tissue, considering possible experimental faults as well as their prior knowledge on the tissue pathology.

17.2.3 Save data

Code

saveRDS(spe[, !ex], "img-spe_qc.rds")

17.3 CosMx-specific QC

To not confuse RNA targets with negative probe counts, we will move these from the main data into an altExp, and compute basic cell-level QC metrics for both RNA targets and negative probes.

For simplicity, we also relabel spatial coordinates.

Code

spe <- cos

Code

neg <- grep("NegPrb", rownames(spe))
altExp(spe, "NegPrb") <- spe[neg, ]
spatialCoordsNames(spe) <- c("x", "y")

# calculate QC metrics
(spe <- addPerCellQC(spe[-neg, ], use_altexps = TRUE))

##  class: SpatialExperiment 
##  dim: 960 91972 
##  metadata(0):
##  assays(1): counts
##  rownames(960): AATK ABL1 ... YES1 ZFP36
##  rowData names(0):
##  colnames(91972): 1 2 ... 91971 91972
##  colData names(25): fov cell_ID ... altexps_NegPrb_percent total
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(1): NegPrb
##  spatialCoords names(2) : x y
##  imgData names(1): sample_id

17.3.1 FOV effects

Imaging-based ST data are acquired through iterative imaging of predefined rectangular regions, so-called fields of view (FOV).

Code

xy <- spatialCoordsNames(spe)
df <- data.frame(colData(spe), spatialCoords(spe))
fd <- df |> 
  group_by(fov) |> 
  summarise(across(xy, median))

ggplot(df, aes(x, y, col = factor(fov))) + 
  geom_point(shape = 16, stroke = 0, size = 0.2) + 
  geom_text(aes(label = fov), fd, col = "black", size = 2) + 
  coord_equal(expand = FALSE) + 
  theme_void() + 
  theme(legend.position = "none")

Assuming FOV placement was done with the aim of capturing biologically interesting regions of the tissue, we would expect every FOV to capture a reasonable number of cells.

Issues in image registration or IF staining, however, may affect spot calling and cell segmentation, and can yield FOVs with virtually no cells. We thus advise inspecting the number of cells across FOVs.

Code

ns <- as.data.frame(table(fov = spe$fov), responseName = "n_cells")

ggplot(ns, aes(fov, n_cells)) + 
  geom_col() + 
  scale_x_discrete(breaks = c(1, seq(5, max(spe$fov), 5))) + 
  scale_y_continuous(limits = c(0, 6e3), n.breaks = 4) + 
  labs(x = "field of view (FOV)", y = "# cells") + 
  coord_cartesian(expand = FALSE) + 
  theme_minimal()

It is generally also worth checking that basic QC metrics do not vary across FOVs.

Such effects will also be driven by differences in subpopulation composition across FOVs. However, gross differences in detection efficacy, IF staining, spot calling, segmentation, etc, often still manifest in FOV-level shifts in IF stains and/or RNA target, negative probe, system control counts.

A simple spot-check is the following type of visualization, but more thorough investigation is advisable in a real analysis, especially in challenging types of tissue.

Code

vs <- "Mean|sum|detected|percent"
vs <- grep(vs, names(df), value = TRUE)
fd <- pivot_longer(df, vs) |> 
  mutate(value = case_when(
    grepl("Mean", name) ~ asinh(value), 
    grepl("^sum", name) ~ log10(value), 
    TRUE ~ value))

ggplot(fd, aes(factor(fov), value)) + 
  facet_wrap(~ name, nrow = 2, scales = "free_y") + 
  geom_boxplot(linewidth = 0.2, outlier.size = 0.2) + 
  scale_x_discrete(breaks = c(1, seq(5, max(spe$fov), 5))) + 
  labs(x = "field of view (FOV)", y = NULL) + 
  theme_minimal()

In our example, DAPI stains and RNA target counts (sum) are fairly constant across FOVs.

FOVs 1-4 have decreased signal for CD3 and CD45, however, this might be related to fewer immune cells being present in these regions rather than technical artefacts.

Taken together, we cannot deem any FOVs as being obviously problematic here.

17.3.2 Border effects

Particularly in CosMx, lack of FOV stitching during cell segmentation can result in fractured and possibly duplicated cells.

To investigate potential artefacts related to this, we can estimate each cell’s distance to every FOV border (precise distances would require considering the exact FOV placement coordinates).

Code

# compute distance to FOV borders
x <- spe$CenterX_local_px
y <- spe$CenterY_local_px
d <- cbind(
  bottom = y - min(y), top = max(y) - y, 
  left = x - min(x), right = max(x) - x)

We can roughly estimate cell radii as \(A=\pi r^2\) \(\leftrightarrow r=\sqrt{A/\pi}\), and use this as a threshold on FOV border distances.

In other words, we exclude a cell when its centroid is closer to any FOV border than the radius of an average circular cell.

Code

(r <- sqrt(mean(spe$Area)/pi))

##  [1] 32.55155

Next, let us visualize total RNA counts against distance to FOV borders, setting the above threshold on the latter. We also include rolling means to better highlight global trends.

Code

df <- data.frame(colData(spe), spatialCoords(spe), d)
fd <- df |> 
  pivot_longer(colnames(d), values_to = "d")
mu <- fd |> 
  group_by(name) |> 
  arrange(d) |> 
  mutate(across(c(d, sum), 
                zoo::rollmean, k = 100, align = "right", fill = NA))

ggplot(fd, aes(d, sum)) + 
  facet_grid(~ name) + 
  geom_point(shape = 16, size = 0.4, alpha = 0.2, col = "navy") + 
  geom_line(data = mu, col = "blue", linewidth = 0.4) + 
  geom_vline(xintercept = r, col = "gold") + 
  labs(x = "distance to FOV border (px)", y = "total counts (RNA)") + 
  scale_y_continuous(limits = c(10, 1e3), trans = "log10") + 
  coord_cartesian(xlim = c(0, 200)) + 
  theme_bw() + 
  theme(aspect.ratio = 1, 
        legend.position = "none", 
        panel.grid.minor = element_blank())

As expected, we observe a decline in counts near FOV borders, with about 3.5% of cells falling below \(r\).

To mitigate potential artefacts in downstream analyses, we may choose to filter out these cells, or otherwise flag them as potentially problematic events to be kept a cautionary eye on.

Code

# drop cells too close to FOV borders
mean(ex <- rowAnys(d < r))

##  [1] 0.03511938

Code

ggplot(df, aes(x, y, col = ex)) + 
  geom_point(shape = 16, stroke = 0, size = 0.2) + 
  scale_color_manual(values = c("gray", "blue")) + 
  coord_equal(expand=FALSE) + 
  theme_void() + 
  theme(legend.position = "none")

Tissue plot with cells that lie too close to FOV borders highlighted in color.

17.4 Appendix

References

Janesick, Amanda, Robert Shelansky, Andrew D. Gottscho, Florian Wagner, Stephen R. Williams, Morgane Rouault, Ghezal Beliakoff, et al. 2023. “High Resolution Mapping of the Tumor Microenvironment Using Integrated Single-Cell, Spatial and in Situ Analysis.” Nature Communications 14 (1): 8353.