17 Quality control
17.1 Preamble
17.1.1 Introduction
The analyses demonstrated here will use 313-plex Xenium data (10x Genomics) of a human breast cancer biopsy section (Janesick et al. 2023) and the 1k-plex CosMx data (Nanostring) of a mouse brain sample.
For the remainder of this Chapter, we will show the two examples in parallel, highlighting when needed the differences between the platforms.
17.1.2 Dependencies
17.2 Data import
SpatialExperimentIO provides functions to import CosMx, Xenium, and MERSCOPE datasets in Bioconductor, starting from the output provided by the three platforms.
17.2.1 CosMx
First, we load the CosMx dataset:
Code
# retrieve CosMx dataset from OSF repo
id <- "CosMx1k_MouseBrain2"
pa <- OSTA.data_load(id)
dir.create(td <- tempfile())
unzip(pa, exdir=td)
# TODO: should use 'SpaExpIO' here...
cos <- SpaceTrooper::readCosmxSPE(td)
# add cell boundary polygons
fnm <- file.path(td, "polygons.csv.gz")
pol <- readPolygonsCosmx(fnm)
idx <- sprintf("f%s_c%s", cos$fov, cos$cellID)
cos$polygons <- pol[match(idx, pol$cell_id), ]
# add info for 'SpaceTrooper'
metadata(cos)$technology <- "Nanostring_CosMx"
cos$sample_id <- "CosMx"
cos
## class: SpatialExperiment
## dim: 960 48556
## metadata(4): fov_positions fov_dim polygons technology
## assays(1): counts
## rownames(960): Chrna4 Slc6a1 ... NegPrb9 NegPrb10
## rowData names(0):
## colnames(48556): f1_c1 f1_c2 ... f130_c314 f130_c315
## colData names(23): fov cellID ... CenterY_global_px polygons
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : CenterX_global_px CenterY_global_px
## imgData names(1): sample_id
17.2.2 Xenium
We now load Xenium data:
Code
# retrieve Xenium dataset from OSF repo
id <- "Xenium_HumanBreast1_Janesick"
pa <- OSTA.data_load(id)
dir.create(td <- tempfile())
unzip(pa, exdir=td)
xen <- readXeniumSXE(td)
# add cell boundary polygons
fnm <- metadata(xen)$cell_boundaries
pol <- readPolygonsXenium(fnm, type="parquet")
xen <- addPolygonsToSPE(xen, pol)
# add info for 'SpaceTrooper'
metadata(xen)$technology <- "10x_Xenium"
xen$sample_id <- "Xenium"
xen
## class: SpatialExperiment
## dim: 313 167780
## metadata(5): experiment.xenium transcripts cell_boundaries
## nucleus_boundaries technology
## assays(1): counts
## rownames(313): ABCC11 ACTA2 ... ZEB2 ZNF562
## rowData names(3): ID Symbol Type
## colnames(167780): 1 2 ... 167779 167780
## colData names(9): cell_id transcript_counts ... sample_id polygons
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(4): NegControlProbe NegControlCodeword antisense BLANK
## spatialCoords names(2) : x_centroid y_centroid
## imgData names(1): sample_id
17.3 Visualization
Imaging-based ST data relies on imaging a 2D tissue section. Hence, an advantage compared to single-cell RNA-seq is that we can readily visualize a 2D map of the cells and directly plot cell features onto the map.
The easiest way to visualize the data is by representing the cells by its centroid. This is useful to have a global bird’s-eye-view of the tissue.
Code
plotCentroids(cos) +
plotCentroids(xen)
It may be useful to visualize the centroids color-coded by cell features: for instance the CosMx output includes the fluorescence intensities of a set of antibodies that mark the nuclei and cell surfaces.
Here, we show the distribution of DAPI across the CosMx sample.
Code
plotCentroids(cos, colour_by="Mean.DAPI")
While plotting centroids is good for a overview of the macro tissue structure, when zooming into a specific area, the centroids do not provide the full information. For instance, looking at one FOV of the CosMx data it is hard to distinguish between areas with low cell densities and areas with large cells.
Code
plotCentroids(cos[,cos$fov == 99], size=1, colour_by="Mean.DAPI")
We can improve the visualization by leveraging the polygons stored in the object from the cell segmentation step.
Code
plotPolygons(cos[,cos$fov == 99], colour_by="Mean.DAPI")
The same function can be applied to Xenium data, but here we do not have independent FOVs. Hence, we define a zoom area based on the slide’s global coordinates.
Code
box <- (spatialCoords(xen)[,1] >= 4000 & spatialCoords(xen)[,1] <= 6000) &
(spatialCoords(xen)[,2] >= 0 & spatialCoords(xen)[,2] <= 2000)
plotPolygons(xen[,box], colour_by="transcript_counts")
Note that while it is possible to use plotPolygonsSPE
on the entire slide, this may be slow and require large amounts of RAM. Since the polygons at that zoom level will be very small and practically indistinguishable to points, we advise against generating a whole-slide polygons plot.
17.3.1 CosMx-specific visualizations
Unlike, Xenium and MERSCOPE, CosMx does not stitch fields of view (FOVs) in a single image, but rather treats each FOV independently. The colData
of the SpatialExperiment
object resulting from a CosMx import will include the FOV of each cell, but it is sometimes tricky to understand (or remember) the relative position of each FOV within the slide.
The plotCellsFovs
function provides an easy way to map the FOV within the slide.
Code
plotCellsFovs(cos)
In addition, the plotZoomFovsMap
provides a way to zoom on a subset of FOVs, while also visualizing the global structure of the tissue.
Code
plotZoomFovsMap(cos, fovs=99, colour_by="Mean.DAPI")
17.4 Simple QC Metrics
Similarly to single-cell RNA-seq, quality control (QC) is an important step to remove low-quality cells and highlights specific biases in the data at the sample, FOV, or slide area level.
Unlike single-cell RNA-seq and sequencing-based ST, most imaging-based ST platforms do not measure the whole transcriptome, nor an “unbiased” sample of it, but rely on the design of gene panels.
Depending on the platform and the application, such gene panels may be pan-tissue or targeting specific tissues or cell types. Moreover, they vary from a few hundreds (e.g., 313 in the first Xenium panel) to several thousands (e.g., the 5k Xenium panel or the 6k and 18k CosMx panel). Hence, unlike in sequencing-based platforms, we may not have enough ribosomal or mitochondrial transcripts to use for computing QC metrics. However, imaging-based platforms provide a set of negative probes that can be used to check for aspecific RNA captures.
Finally, imaging-based platforms include, in addition to transcripts data, morphology information (such as the area and eccentricity of the segmented cells) and intensity information from antibody stains (such as those targeting the nuclei or the cell membrane).
SpaceTrooper provides a function to compute and add to the SpatialData object a set of QC metrics. Specifically:
-
Area_um
: the area in microns; -
log2AspectRatio
: the log2 of the aspect ratio (x-axis/y-axis) of each cell; -
ctrl_total_ratio
: the proportion of control probe counts over the total cell counts; -
log2CountArea
: the log2 of the count density, i.e., the number of transcripts divided by the cell area.
Code
## These lines are needed temporarily because we are working with a STexampleData object
## rather than one imported from the 10X Genomics output
xen$AspectRatio <- computeAspectRatioFromPolygons(xen$polygons)
## Warning in computeAspectRatioFromPolygons(xen$polygons): Found 2
## multi-poligons: returning NA aspect ratio for them.
Code
xen$Area_um <- xen$cell_area
xen <- spatialPerCellQC(xen)
colData(xen)[,c("Area_um", "log2AspectRatio", "log2CountArea")]
## DataFrame with 167780 rows and 3 columns
## Area_um log2AspectRatio log2CountArea
## <numeric> <numeric> <numeric>
## 1 58.3870 -0.321928 -1.060221
## 2 197.0167 -0.540571 -1.067585
## 3 16.2563 -0.514575 -0.852998
## 4 42.3114 -0.156115 -1.943543
## 5 107.6525 0.535040 -1.165276
## ... ... ... ...
## 167776 220.4528 1.1212767 0.0548777
## 167777 37.3894 -0.2824356 1.0792243
## 167778 287.0583 0.0647054 0.4677953
## 167779 235.3544 0.1927043 -1.0083261
## 167780 270.0795 0.9896628 0.4850019
In addition, for CosMx data, it is important to calculate the distance from the horizontal, vertical, and closest FOV border: dist_border_x
, dist_border_y
, dist_border
, respectively.
Code
cos <- spatialPerCellQC(cos)
colData(cos)[, grep("^dist", names(colData(cos)))]
## DataFrame with 48556 rows and 3 columns
## dist_border_x dist_border_y dist_border
## <numeric> <numeric> <numeric>
## f1_c1 34 190 34
## f1_c2 74 318 74
## f1_c3 2100 392 392
## f1_c4 148 648 148
## f1_c5 94 750 94
## ... ... ... ...
## f130_c311 748 1588 748
## f130_c312 1407 1384 1384
## f130_c313 1466 883 883
## f130_c314 1398 871 871
## f130_c315 1438 818 818
Imaging-based ST data relies on imaging a 2D tissue section. Cells thus represent a slice through a 3D system, and we expect a tight relationship between a cell’s counts and its area. When we inspect total counts and cell areas as separate (univariate) distributions, both appear to follow approximate log-normal distributions, both in CosMx:
Code
df <- data.frame(colData(cos))
fd <- pivot_longer(df, c("Area_um", "total"))
mu <- summarise_at(group_by(fd, name), "value", median)
ggplot(fd, aes(value)) + facet_grid(~name) +
geom_histogram(bins=50, linewidth=0.1, fill="grey") +
geom_vline(data=mu, aes(xintercept=value), col="blue") +
geom_text(
hjust=-0.1, size=3, col="blue",
data=mu, aes(value, 0, label=round(value))) +
scale_x_continuous(NULL, trans="log10") + ylab("# cells") +
theme_minimal() + theme(panel.grid.minor=element_blank())
And in Xenium:
Code
dfx <- data.frame(colData(xen))
fd <- pivot_longer(dfx, c("cell_area", "total_counts"))
mu <- summarise_at(group_by(fd, name), "value", median)
ggplot(fd, aes(value)) + facet_grid(~name) +
geom_histogram(bins=50, linewidth=0.1, fill="grey") +
geom_vline(data=mu, aes(xintercept=value), col="blue") +
geom_text(
hjust=-0.1, size=3, col="blue",
data=mu, aes(value, 0, label=round(value))) +
scale_x_continuous(NULL, trans="log10") + ylab("# cells") +
theme_minimal() + theme(panel.grid.minor=element_blank())
Note that the cell area is much bigger in Xenium, mostly because of the different cell segmentation algorithms used.
We can also view the distribution of these metrics across the tissue slide by plotting cell centroids colored by total counts and cell area, respectively:
Code
p1 <- plotCentroids(cos, colour_by="total")
p2 <- plotCentroids(cos, colour_by="Area_um")
p3 <- plotCentroids(xen, colour_by="total_counts")
p4 <- plotCentroids(xen, colour_by="cell_area")
(p1/p2 | p3/p4) & coord_equal()
In a bivariate setting, we observe a positive relationship between counts and area (as expected). However, this dependency is not always linear and may exhibit a bimodal distribution.
Code
.p <- \(df, x) {
ggplot(df, aes(.data[[x]], total)) +
geom_point(shape=16, size=0.1, alpha=0.1) +
geom_smooth(method="lm", color="blue") +
scale_x_log10() + scale_y_log10() +
theme_minimal() + theme(
aspect.ratio=1,
panel.grid.minor=element_blank())
}
.p(df, "Area_um") + ggtitle("CosMx") +
.p(dfx, "cell_area") + ggtitle("Xenium")
In some cases, a peculiar count-area distribution might be indicative of cell segmentation issues. When groups of cells are merged together, for example, we might observe fewer counts than expected for a given area (when they are far apart, i.e., empty space is being segmented).
An useful metric to explore this issue is the log2CountArea
, which estimates the transcript density for each cell.
Code
plotCentroids(cos, colour_by="log2CountArea") +
plotCentroids(xen, colour_by="log2CountArea") +
plot_layout() & coord_equal()
Considering the spatial distribution of the transcript density, it is reasonable to assume that they represent distinct transcriptional states (not technical artefacts), with the exception of a small area at the bottom-center of the Xenium slide.
17.4.1 CosMx-specific considerations
Recall that the CosMx platform allows the user to place independent FOV across the tissue slide and does not stitch together adjacent FOVs.
Assuming FOV placement was done with the aim of capturing biologically interesting regions of the tissue, we would expect every FOV to capture a decent number of cells. Issues in image registration or IF staining, however, may affect spot calling and cell segmentation, and can yield FOVs with virtually no cells. We thus advice inspecting the number of cells across FOVs:
Code
ns <- as.data.frame(table(fov=cos$fov), responseName="n_cells")
ggplot(ns, aes(fov, n_cells)) + geom_col(fill="royalblue") +
scale_x_discrete(breaks=c(1, seq(5, max(cos$fov), 5))) +
scale_y_continuous(limits=c(0, 1e3), n.breaks=4) +
labs(x="field of view (FOV)", y="# cells") +
coord_cartesian(expand=FALSE) + theme_minimal()
It is generally also worth checking that basic QC metrics do not vary across FOVs. Of course, such effects will also be driven by differences in subpopulation composition across FOVs. However, gross differences in detection efficacy, IF staining, spot calling, segmentation etc. often still manifest in FOV-level shifts in IF stains and/or RNA target, negative probe, system control counts. A simple spot-check is the following type of visualization, but more thorough investigation is advisable ‘in the wild’, especially so in challenging types of tissue:
Code
df <- data.frame(colData(cos))
vs <- "Mean"
vs <- grep(vs, names(df), value=TRUE)
vs <- c(vs, "Area_um", "log2CountArea", "detected", "total", "ctrl_total_ratio")
fd <- pivot_longer(df, vs) |>
mutate(value=case_when(
grepl("Mean", name) ~ asinh(value),
grepl("total", name) ~ log10(value),
grepl("^Area", name) ~ log10(value),
TRUE ~ value))
ggplot(fd, aes(factor(fov), value)) +
facet_wrap(~name, nrow=2, scales="free_y") +
geom_boxplot(linewidth=0.2, outlier.size=0.2) +
scale_x_discrete(breaks=c(1, seq(5, max(cos$fov), 5))) +
labs(x="field of view (FOV)", y=NULL) + theme_minimal()
In our example, DAPI stains and RNA target counts (total
) are fairly constant across FOVs, some systematic differences between FOVs seem to reflect morphological differences across FOVs. Taken together, we cannot deem any FOVs as being obviously problematic here.
17.4.2 Border effects
Particularly in CosMx, lack of FOV stitching during cell segmentation can result in fractured and possibly duplicated cells. To investigate potential artefacts related to this, we can use the distance to FOV border calculated above.
We can roughly estimate cell radii as \(A=\pi r^2\) \(\leftrightarrow r=\sqrt{A/\pi}\), and use this as a threshold on FOV border distances. In other words, we flag a cell when its centroid is closer to any FOV border than the radius of an average circular cell:
Next, let us visualize total RNA counts against distance to FOV borders, setting the above threshold on the latter; we also include rolling means to better highlight global trends:
Code
df <- data.frame(colData(cos))
.p <- \(df, x, y) {
ggplot(df, aes(.data[[x]], .data[[y]])) +
geom_point(color="grey", size=0.1) +
geom_smooth() + theme_minimal() +
geom_vline(xintercept=r, linetype="longdash") +
geom_vline(xintercept=50, linetype="dashed")
}
p1 <- .p(df, "dist_border", "total") + xlim(0, 500) + ylim(0, 2000)
p2 <- .p(df, "dist_border", "Area_um") + xlim(0, 500) + ylim(0, 200)
p3 <- .p(df, "dist_border_x", "log2AspectRatio") + xlim(0, 500)
p4 <- .p(df, "dist_border_y", "log2AspectRatio") + xlim(0, 500)
(p1 + p2) / (p3 + p4)
We observe about 6.6% of cells falling below \(r\) (long dashed line) and 5.1% of cells falling below 50 pixels (short dashed line).
These cells exhibit lower total counts and smaller segmented areas, as well as a distorsion in their aspect ratio (thin cells close to the y border and broad cells close to the x boder).
To mitigate potential artefacts in downstream analyses, we may choose to filter out these cells, or otherwise flag them as potentially problematic events to be kept a cautionary eye on. In addition, SpaceTrooper uses the distance to border combined with the aspect ratio in its combined quality score calculations (see below).
Code
cos$ex <- df$dist_border < 50
plotCentroids(cos, colour_by="ex") +
scale_color_manual(values=c("grey", "blue"))
Code
## plotPolygons(cos[,cos$fov %in% c(7, 8, 11, 12)], colour_by="ex")
17.5 Identifying low-quality cells
We can now use the QC metrics calculated above to flag low-quality cells.
First, we can inspect individual metrics looking for outliers in their distribution. Depending on the metric, we are interested in low-end or high-end outliers (or both). For instance, we may want to exclude too small/large cells as potential artifacts.
SpaceTrooper provides a function to compute outliers of the distribution of any QC metric present in the colData
of the object. We can then visualize the distribution and outlier threshold with a histogram.
Code
cos <- computeSpatialOutlier(cos, compute_by="Area_um", method="mc")
cos <- computeSpatialOutlier(cos, compute_by="total", method="mc")
plotMetricHist(cos, metric="Area_um", use_fences="Area_um_outlier_mc") /
plotMetricHist(cos, metric="total", use_fences="total_outlier_mc")
For instance, in the CosMx data, we identify 341 outliers for cell area and 332 outliers for total counts.
Code
xen <- computeSpatialOutlier(xen, compute_by="cell_area", method="mc")
xen <- computeSpatialOutlier(xen, compute_by="total_counts", method="mc")
plotMetricHist(xen, metric="cell_area", use_fences="cell_area_outlier_mc") /
plotMetricHist(xen, metric="total_counts", use_fences="total_counts_outlier_mc")
In the Xenium data, we identify 5472 outliers for cell area and 402 outliers for total counts. However, note that the because of the naive cell segmentation algorithm of Xenium v1, we expect some abnormally large cells in low-cell-density areas, which do not necessarily correspond to low-quality cells.
Finally, we can compute an aggregated quality score, which combines a cell’s transcript density, its aspect ratio and its distance from the FoV border (only for CosMx).
Code
cos <- computeQCScore(cos)
cos <- computeFilterFlags(cos, fs_threshold=0.75)
plotCentroids(cos, colour_by="flag_score") +
plotCentroids(cos, colour_by="is_fscore_outlier") +
plot_layout() & coord_equal()
Code
xen <- computeQCScore(xen)
xen <- computeFilterFlags(xen, fs_threshold=0.1)
plotCentroids(xen, colour_by="flag_score") +
plotCentroids(xen, colour_by="is_fscore_outlier") +
plot_layout() & coord_equal()
We can see how the quality score flags several cells close to the tissue borders in the CosMx data and cells in the bottom-center damaged area in the Xenium data.
In a stringent analysis we may want to remove low-quality cells, but this has to be done with caution: before finalizing to exclude any cells, it is recommendable to inspect where these fall in the tissue. We would expect low-quality cells to be distributed randomly, and otherwise to accumulate at the tissue borders or in regions were tissue was, for example, smudged, detached, necrotic etc. In general, spatially organized clusters of excluded cells might indicate a bias towards exclude specific types of cells (e.g., some immune cells tend to be smaller or might otherwise contain fewer contains due to the panel design). Whenever possible, we advice readers to cross-check visualizations like the following with a corresponding H&E stain of the tissue, considering possible experimental faults as well as their prior knowledge on tissue pathology.