library(SpatialExperiment)
spe <- readRDS("spe_load.rds")
6 Quality control
6.1 Introduction
Quality control (QC) procedures at the spot level aim to remove low-quality spots before further analysis. Low-quality spots can occur due to problems during library preparation or other experimental procedures. Examples include large proportions of dead cells due to cell damage during library preparation, and low mRNA capture efficiency due to inefficient reverse transcription or PCR amplification.
These spots are usually removed prior to further analysis, since otherwise they tend to create problems during downstream analyses such as clustering. For example, problematic spots that are not removed could show up as separate clusters, which may be misidentified as distinct cell types.
Low-quality spots can be identified according to several characteristics, including:
library size (i.e. total unique molecular identifier (UMI) counts per spot)
number of expressed features (i.e. number of genes with non-zero UMI counts per spot)
proportion of reads mapping to mitochondrial genes (a high proportion indicates cell damage)
number of cells per spot (unusually high values can indicate problems)
Low library size or low number of expressed features can indicate poor mRNA capture rates, e.g. due to cell damage and missing mRNAs, or low reaction efficiency. A high proportion of mitochondrial reads indicates cell damage, e.g. partial cell lysis leading to leakage and missing cytoplasmic mRNAs, with the resulting reads therefore concentrated on the remaining mitochondrial mRNAs that are relatively protected inside the mitochondrial membrane. Unusually high numbers of cells per spot can indicate problems during cell segmentation.
The first three characteristics listed above are also used for QC in scRNA-seq data. However, the expected distributions for high-quality spots are different (compared to high-quality cells in scRNA-seq), since spots may contain zero, one, or multiple cells.
6.2 Load previously saved data
We start by loading the previously saved data object(s) (see Section 5.4).
6.3 Plot data
As an initial check, plot the spatial coordinates (spots) in x-y dimensions on the tissue slide, to check that the object has loaded correctly and that the orientation is as expected.
We use visualization functions from the ggspavis package to generate plots.
# plot spatial coordinates (spots)
plotSpots(spe)
6.4 Calculate QC metrics
We calculate the QC metrics described above with a combination of methods from the scater
(McCarthy et al. 2017) package (for metrics that are also used for scRNA-seq data, where we treat spots as equivalent to cells) and our own functions.
The QC metrics from scater
can be calculated and added to the SpatialExperiment
object as follows. Here, we also identify mitochondrial reads using their gene names, and pass these as an argument to scater
.
First, we subset the object to keep only spots over tissue. The remaining spots are background spots, which we are not interested in.
# subset to keep only spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]
dim(spe)
## [1] 33538 3639
# identify mitochondrial genes
is_mito <- grepl("(^MT-)|(^mt-)", rowData(spe)$gene_name)
table(is_mito)
## is_mito
## FALSE TRUE
## 33525 13
rowData(spe)$gene_name[is_mito]
## [1] "MT-ND1" "MT-ND2" "MT-CO1" "MT-CO2" "MT-ATP8" "MT-ATP6" "MT-CO3"
## [8] "MT-ND3" "MT-ND4L" "MT-ND4" "MT-ND5" "MT-ND6" "MT-CYB"
# calculate per-spot QC metrics and store in colData
spe <- addPerCellQC(spe, subsets = list(mito = is_mito))
head(colData(spe))
## DataFrame with 6 rows and 14 columns
## barcode_id sample_id in_tissue array_row
## <character> <character> <integer> <integer>
## AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673 1 50
## AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673 1 3
## AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673 1 59
## AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1 sample_151673 1 14
## AAACAGCTTTCAGAAG-1 AAACAGCTTTCAGAAG-1 sample_151673 1 43
## AAACAGGGTCTATATT-1 AAACAGGGTCTATATT-1 sample_151673 1 47
## array_col ground_truth reference cell_count sum
## <integer> <character> <character> <integer> <numeric>
## AAACAAGTATCTCCCA-1 102 Layer3 Layer3 6 8458
## AAACAATCTACTAGCA-1 43 Layer1 Layer1 16 1667
## AAACACCAATAACTGC-1 19 WM WM 5 3769
## AAACAGAGCGACTCCT-1 94 Layer3 Layer3 2 5433
## AAACAGCTTTCAGAAG-1 9 Layer5 Layer5 4 4278
## AAACAGGGTCTATATT-1 13 Layer6 Layer6 6 4004
## detected subsets_mito_sum subsets_mito_detected
## <numeric> <numeric> <numeric>
## AAACAAGTATCTCCCA-1 3586 1407 13
## AAACAATCTACTAGCA-1 1150 204 11
## AAACACCAATAACTGC-1 1960 430 13
## AAACAGAGCGACTCCT-1 2424 1316 13
## AAACAGCTTTCAGAAG-1 2264 651 12
## AAACAGGGTCTATATT-1 2178 621 13
## subsets_mito_percent total
## <numeric> <numeric>
## AAACAAGTATCTCCCA-1 16.6351 8458
## AAACAATCTACTAGCA-1 12.2376 1667
## AAACACCAATAACTGC-1 11.4089 3769
## AAACAGAGCGACTCCT-1 24.2223 5433
## AAACAGCTTTCAGAAG-1 15.2174 4278
## AAACAGGGTCTATATT-1 15.5095 4004
6.5 Selecting thresholds
The simplest option to apply the QC metrics is to select thresholds for each metric, and remove any spots that do not meet the thresholds for one or more metrics. Exploratory visualizations can be used to help select appropriate thresholds, which may differ depending on the dataset.
Here, we use visualizations to select thresholds for several QC metrics in our human DLPFC dataset: (i) library size, (ii) number of expressed genes, (iii) proportion of mitochondrial reads, and (iv) number of cells per spot.
6.5.1 Library size
Library size represents the total sum of UMI counts per spot. This is included in the column labeled sum
in the scater
output.
Plot a histogram of the library sizes across spots.
# histogram of library sizes
hist(colData(spe)$sum, breaks = 20)
The distribution is relatively smooth, and there are no obvious issue such as a spike at very low library sizes.
We also plot the library sizes against the number of cells per spot (which is available for this dataset). This is to check that we are not inadvertently removing a biologically meaningful group of spots. The horizontal line (argument threshold
) shows our first guess at a possible filtering threshold for library size based on the histogram.
# plot library size vs. number of cells per spot
plotSpotQC(spe, plot_type = "scatter",
x_metric = "cell_count", y_metric = "sum",
y_threshold = 600)
## `geom_smooth()` using formula = 'y ~ x'
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.
The plot shows that setting a filtering threshold for library size (e.g. at the value shown) does not appear to select for any obvious biologically consistent group of spots.
We set a relatively arbitrary threshold of 600 UMI counts per spot, and then check the number of spots below this threshold.
# select QC threshold for library size
qc_lib_size <- colData(spe)$sum < 600
table(qc_lib_size)
## qc_lib_size
## FALSE TRUE
## 3631 8
colData(spe)$qc_lib_size <- qc_lib_size
Finally, we also check that the discarded spots do not have any obvious spatial pattern that correlates with known biological features. Otherwise, removing these spots could indicate that we have set the threshold too high, and are removing biologically informative spots.
# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_lib_size")
As an aside, here we can also illustrate what happens if we set the threshold too high. For example, if we set the threshold to 2000 UMI counts per spot – which may also seem like a reasonable value based on the histogram and scatterplot – then we see a possible spatial pattern in the discarded spots, matching the cortical layers. This illustrates the importance of interactively checking exploratory visualizations when choosing these thresholds.
# check spatial pattern of discarded spots if threshold is too high
qc_lib_size_2000 <- colData(spe)$sum < 2000
colData(spe)$qc_lib_size_2000 <- qc_lib_size_2000
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_lib_size_2000")
For reference, here are the reference (manually annotated) cortical layers in this dataset.
# plot reference (manually annotated) layers
plotSpots(spe, annotate = "ground_truth",
pal = "libd_layer_colors")
6.5.2 Number of expressed features
The number of expressed features refers to the number of genes with non-zero UMI counts per spot. This is stored in the column detected
in the scater
output.
We use a similar sequence of visualizations to choose a threshold for this QC metric.
# histogram of numbers of expressed genes
hist(colData(spe)$detected, breaks = 20)
# plot number of expressed genes vs. number of cells per spot
plotSpotQC(spe, plot_type = "scatter",
x_metric = "cell_count", y_metric = "detected",
y_threshold = 400)
## `geom_smooth()` using formula = 'y ~ x'
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.
Based on the plots, we select a threshold of 400 expressed genes per spot.
# select QC threshold for number of expressed genes
qc_detected <- colData(spe)$detected < 400
table(qc_detected)
## qc_detected
## FALSE TRUE
## 3632 7
colData(spe)$qc_detected <- qc_detected
# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_detected")
Again, we also check what happens when we set the threshold too high.
# check spatial pattern of discarded spots if threshold is too high
qc_detected_1000 <- colData(spe)$detected < 1000
colData(spe)$qc_detected_1000 <- qc_detected_1000
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_detected_1000")
6.5.3 Proportion of mitochondrial reads
A high proportion of mitochondrial reads indicates cell damage.
We investigate the proportions of mitochondrial reads across spots, and select an appropriate threshold. The proportions of mitochondrial reads per spot are stored in the column subsets_mito_percent
in the scater
output.
# histogram of mitochondrial read proportions
hist(colData(spe)$subsets_mito_percent, breaks = 20)
# plot mitochondrial read proportion vs. number of cells per spot
plotSpotQC(spe, plot_type = "scatter",
x_metric = "cell_count", y_metric = "subsets_mito_percent",
y_threshold = 28)
## `geom_smooth()` using formula = 'y ~ x'
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.
We select a threshold of 28% for the mitochondrial read proportion.
# select QC threshold for mitochondrial read proportion
qc_mito <- colData(spe)$subsets_mito_percent > 28
table(qc_mito)
## qc_mito
## FALSE TRUE
## 3622 17
colData(spe)$qc_mito <- qc_mito
# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_mito")
We also check what happens when we set the threshold too low.
# check spatial pattern of discarded spots if threshold is too high
qc_mito_25 <- colData(spe)$subsets_mito_percent > 25
colData(spe)$qc_mito_25 <- qc_mito_25
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_mito_25")
6.5.4 Number of cells per spot
The number of cells per spot depends on the tissue type and organism.
Here, we check for any outlier values that could indicate problems during cell segmentation.
# histogram of cell counts
hist(colData(spe)$cell_count, breaks = 20)
# distribution of cells per spot
tbl_cells_per_spot <- table(colData(spe)$cell_count)
We see a tail of very high values, which could indicate problems for these spots. These values are also visible on the scatterplots. Here, we again plot the number of expressed genes vs. cell count, with an added trend.
# plot number of expressed genes vs. number of cells per spot
plotSpotQC(spe, plot_type = "scatter",
x_metric = "cell_count", y_metric = "detected",
x_threshold = 10)
## `geom_smooth()` using formula = 'y ~ x'
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.
In particular, we see that the spots with very high cell counts also have low numbers of expressed genes. This indicates that the experiments have failed for these spots, and they should be removed.
We select a threshold of 10 cells per spot. The number of spots above this threshold is relatively small, and there is a clear downward trend in the number of expressed genes above this threshold.
# select QC threshold for number of cells per spot
qc_cell_count <- colData(spe)$cell_count > 10
table(qc_cell_count)
## qc_cell_count
## FALSE TRUE
## 3549 90
colData(spe)$qc_cell_count <- qc_cell_count
# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_cell_count")
While there is a spatial pattern to the discarded spots, it does not appear to be correlated with the known biological features (cortical layers). The discarded spots are all on the edges of the tissue. It seems plausible that something has gone wrong with the cell segmentation on the edges of the images, so it makes sense to remove these spots.
6.5.5 Remove low-quality spots
Now that we have calculated several QC metrics and selected thresholds for each one, we can combine the sets of low-quality spots, and remove them from our object.
We also check again that the combined set of discarded spots does not correspond to any obvious biologically relevant group of spots.
# number of discarded spots for each metric
apply(cbind(qc_lib_size, qc_detected, qc_mito, qc_cell_count), 2, sum)
## qc_lib_size qc_detected qc_mito qc_cell_count
## 8 7 17 90
# combined set of discarded spots
discard <- qc_lib_size | qc_detected | qc_mito | qc_cell_count
table(discard)
## discard
## FALSE TRUE
## 3524 115
# store in object
colData(spe)$discard <- discard
# check spatial pattern of combined set of discarded spots
plotSpotQC(spe, plot_type = "spot",
annotate = "discard")
# remove combined set of low-quality spots
spe <- spe[, !colData(spe)$discard]
dim(spe)
## [1] 33538 3524
6.6 Zero-cell and single-cell spots
A particular characteristic of Visium data is that spots can contain zero, one, or multiple cells.
We could also imagine other filtering procedures such as (i) removing spots with zero cells, or (ii) restricting the analysis to spots containing a single cell (which would make the data more similar to scRNA-seq).
However, this would discard a large amount of information. Below, we show the distribution of cells per spot again (up to the filtering threshold of 10 cells per spot from above).
# distribution of cells per spot
tbl_cells_per_spot[1:13]
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 84 211 483 623 617 541 421 287 140 92 50 25 18
# as proportions
prop_cells_per_spot <- round(tbl_cells_per_spot / sum(tbl_cells_per_spot), 2)
prop_cells_per_spot[1:13]
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 0.02 0.06 0.13 0.17 0.17 0.15 0.12 0.08 0.04 0.03 0.01 0.01 0.00
Only 6% of spots contain a single cell. If we restricted the analysis to these spots only, we would be discarding most of the data.
Removing the spots containing zero cells (2% of spots) would also be problematic, since these spots can also contain biologically meaningful information. For example, in this brain dataset, the regions between cell bodies consists of neuropil (dense networks of axons and dendrites). In our paper (Maynard et al. 2021), we explore the information contained in these neuropil spots.
6.7 Quality control at gene level
The sections above consider quality control at the spot level. In some datasets, it may also be appropriate to apply quality control procedures or filtering at the gene level. For example, certain genes may be biologically irrelevant for downstream analyses.
However, here we make a distinction between quality control and feature selection. Removing biologically uninteresting genes (such as mitochondrial genes) may also be considered as part of feature selection, since there is no underlying experimental procedure that has failed. Therefore, we will discuss gene-level filtering in the Chapter 8 chapter.
6.8 Save objects for later chapters
We also save the object(s) in .rds
format for re-use within later chapters to speed up the build time of the book.
# save object(s)
saveRDS(spe, file = "spe_qc.rds")