6 Quality control

6.1 Introduction

Quality control (QC) procedures at the spot level aim to remove low-quality spots before further analysis. Low-quality spots can occur due to problems during library preparation or other experimental procedures. Examples include large proportions of dead cells due to cell damage during library preparation, and low mRNA capture efficiency due to inefficient reverse transcription or PCR amplification.

These spots are usually removed prior to further analysis, since otherwise they tend to create problems during downstream analyses such as clustering. For example, problematic spots that are not removed could show up as separate clusters, which may be misidentified as distinct cell types.

Low-quality spots can be identified according to several characteristics, including:

library size (i.e. total unique molecular identifier (UMI) counts per spot)
number of expressed features (i.e. number of genes with non-zero UMI counts per spot)
proportion of reads mapping to mitochondrial genes (a high proportion indicates cell damage)
number of cells per spot (unusually high values can indicate problems)

Low library size or low number of expressed features can indicate poor mRNA capture rates, e.g. due to cell damage and missing mRNAs, or low reaction efficiency. A high proportion of mitochondrial reads indicates cell damage, e.g. partial cell lysis leading to leakage and missing cytoplasmic mRNAs, with the resulting reads therefore concentrated on the remaining mitochondrial mRNAs that are relatively protected inside the mitochondrial membrane. Unusually high numbers of cells per spot can indicate problems during cell segmentation.

The first three characteristics listed above are also used for QC in scRNA-seq data. However, the expected distributions for high-quality spots are different (compared to high-quality cells in scRNA-seq), since spots may contain zero, one, or multiple cells.

6.2 Load previously saved data

We start by loading the previously saved data object(s) (see Section 5.4).

library(SpatialExperiment)
spe <- readRDS("spe_load.rds")

6.3 Plot data

As an initial check, plot the spatial coordinates (spots) in x-y dimensions on the tissue slide, to check that the object has loaded correctly and that the orientation is as expected.

We use visualization functions from the ggspavis package to generate plots.

library(ggspavis)

# plot spatial coordinates (spots)
plotSpots(spe)

6.4 Calculate QC metrics

We calculate the QC metrics described above with a combination of methods from the scater (McCarthy et al. 2017) package (for metrics that are also used for scRNA-seq data, where we treat spots as equivalent to cells) and our own functions.

The QC metrics from scater can be calculated and added to the SpatialExperiment object as follows. Here, we also identify mitochondrial reads using their gene names, and pass these as an argument to scater.

First, we subset the object to keep only spots over tissue. The remaining spots are background spots, which we are not interested in.

library(scater)

# subset to keep only spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]
dim(spe)

##  [1] 33538  3639

# identify mitochondrial genes
is_mito <- grepl("(^MT-)|(^mt-)", rowData(spe)$gene_name)
table(is_mito)

##  is_mito
##  FALSE  TRUE 
##  33525    13

rowData(spe)$gene_name[is_mito]

##   [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
##   [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB"

# calculate per-spot QC metrics and store in colData
spe <- addPerCellQC(spe, subsets = list(mito = is_mito))
head(colData(spe))

##  DataFrame with 6 rows and 14 columns
##                             barcode_id     sample_id in_tissue array_row
##                            <character>   <character> <integer> <integer>
##  AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673         1        50
##  AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673         1         3
##  AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673         1        59
##  AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1 sample_151673         1        14
##  AAACAGCTTTCAGAAG-1 AAACAGCTTTCAGAAG-1 sample_151673         1        43
##  AAACAGGGTCTATATT-1 AAACAGGGTCTATATT-1 sample_151673         1        47
##                     array_col ground_truth   reference cell_count       sum
##                     <integer>  <character> <character>  <integer> <numeric>
##  AAACAAGTATCTCCCA-1       102       Layer3      Layer3          6      8458
##  AAACAATCTACTAGCA-1        43       Layer1      Layer1         16      1667
##  AAACACCAATAACTGC-1        19           WM          WM          5      3769
##  AAACAGAGCGACTCCT-1        94       Layer3      Layer3          2      5433
##  AAACAGCTTTCAGAAG-1         9       Layer5      Layer5          4      4278
##  AAACAGGGTCTATATT-1        13       Layer6      Layer6          6      4004
##                      detected subsets_mito_sum subsets_mito_detected
##                     <numeric>        <numeric>             <numeric>
##  AAACAAGTATCTCCCA-1      3586             1407                    13
##  AAACAATCTACTAGCA-1      1150              204                    11
##  AAACACCAATAACTGC-1      1960              430                    13
##  AAACAGAGCGACTCCT-1      2424             1316                    13
##  AAACAGCTTTCAGAAG-1      2264              651                    12
##  AAACAGGGTCTATATT-1      2178              621                    13
##                     subsets_mito_percent     total
##                                <numeric> <numeric>
##  AAACAAGTATCTCCCA-1              16.6351      8458
##  AAACAATCTACTAGCA-1              12.2376      1667
##  AAACACCAATAACTGC-1              11.4089      3769
##  AAACAGAGCGACTCCT-1              24.2223      5433
##  AAACAGCTTTCAGAAG-1              15.2174      4278
##  AAACAGGGTCTATATT-1              15.5095      4004

6.5 Selecting thresholds

The simplest option to apply the QC metrics is to select thresholds for each metric, and remove any spots that do not meet the thresholds for one or more metrics. Exploratory visualizations can be used to help select appropriate thresholds, which may differ depending on the dataset.

Here, we use visualizations to select thresholds for several QC metrics in our human DLPFC dataset: (i) library size, (ii) number of expressed genes, (iii) proportion of mitochondrial reads, and (iv) number of cells per spot.

6.5.1 Library size

Library size represents the total sum of UMI counts per spot. This is included in the column labeled sum in the scater output.

Plot a histogram of the library sizes across spots.

# histogram of library sizes
hist(colData(spe)$sum, breaks = 20)

The distribution is relatively smooth, and there are no obvious issue such as a spike at very low library sizes.

We also plot the library sizes against the number of cells per spot (which is available for this dataset). This is to check that we are not inadvertently removing a biologically meaningful group of spots. The horizontal line (argument threshold) shows our first guess at a possible filtering threshold for library size based on the histogram.

# plot library size vs. number of cells per spot
plotSpotQC(spe, plot_type = "scatter", 
           x_metric = "cell_count", y_metric = "sum", 
           y_threshold = 600)

##  `geom_smooth()` using formula = 'y ~ x'
##  `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
##  `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.

The plot shows that setting a filtering threshold for library size (e.g. at the value shown) does not appear to select for any obvious biologically consistent group of spots.

We set a relatively arbitrary threshold of 600 UMI counts per spot, and then check the number of spots below this threshold.

# select QC threshold for library size
qc_lib_size <- colData(spe)$sum < 600
table(qc_lib_size)

##  qc_lib_size
##  FALSE  TRUE 
##   3631     8

colData(spe)$qc_lib_size <- qc_lib_size

Finally, we also check that the discarded spots do not have any obvious spatial pattern that correlates with known biological features. Otherwise, removing these spots could indicate that we have set the threshold too high, and are removing biologically informative spots.

# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot", 
           annotate = "qc_lib_size")

As an aside, here we can also illustrate what happens if we set the threshold too high. For example, if we set the threshold to 2000 UMI counts per spot – which may also seem like a reasonable value based on the histogram and scatterplot – then we see a possible spatial pattern in the discarded spots, matching the cortical layers. This illustrates the importance of interactively checking exploratory visualizations when choosing these thresholds.

# check spatial pattern of discarded spots if threshold is too high
qc_lib_size_2000 <- colData(spe)$sum < 2000
colData(spe)$qc_lib_size_2000 <- qc_lib_size_2000
plotSpotQC(spe, plot_type = "spot", 
           annotate = "qc_lib_size_2000")

For reference, here are the reference (manually annotated) cortical layers in this dataset.

# plot reference (manually annotated) layers
plotSpots(spe, annotate = "ground_truth", 
          pal = "libd_layer_colors")

6.5.2 Number of expressed features

The number of expressed features refers to the number of genes with non-zero UMI counts per spot. This is stored in the column detected in the scater output.

We use a similar sequence of visualizations to choose a threshold for this QC metric.

# histogram of numbers of expressed genes
hist(colData(spe)$detected, breaks = 20)

# plot number of expressed genes vs. number of cells per spot
plotSpotQC(spe, plot_type = "scatter", 
           x_metric = "cell_count", y_metric = "detected", 
           y_threshold = 400)

##  `geom_smooth()` using formula = 'y ~ x'
##  `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
##  `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the plots, we select a threshold of 400 expressed genes per spot.

# select QC threshold for number of expressed genes
qc_detected <- colData(spe)$detected < 400
table(qc_detected)

##  qc_detected
##  FALSE  TRUE 
##   3632     7

colData(spe)$qc_detected <- qc_detected

# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot", 
           annotate = "qc_detected")

Again, we also check what happens when we set the threshold too high.

# check spatial pattern of discarded spots if threshold is too high
qc_detected_1000 <- colData(spe)$detected < 1000
colData(spe)$qc_detected_1000 <- qc_detected_1000
plotSpotQC(spe, plot_type = "spot", 
           annotate = "qc_detected_1000")

6.5.3 Proportion of mitochondrial reads

A high proportion of mitochondrial reads indicates cell damage.

We investigate the proportions of mitochondrial reads across spots, and select an appropriate threshold. The proportions of mitochondrial reads per spot are stored in the column subsets_mito_percent in the scater output.

# histogram of mitochondrial read proportions
hist(colData(spe)$subsets_mito_percent, breaks = 20)

# plot mitochondrial read proportion vs. number of cells per spot
plotSpotQC(spe, plot_type = "scatter", 
           x_metric = "cell_count", y_metric = "subsets_mito_percent", 
           y_threshold = 28)

##  `geom_smooth()` using formula = 'y ~ x'
##  `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
##  `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.

We select a threshold of 28% for the mitochondrial read proportion.

# select QC threshold for mitochondrial read proportion
qc_mito <- colData(spe)$subsets_mito_percent > 28
table(qc_mito)

##  qc_mito
##  FALSE  TRUE 
##   3622    17

colData(spe)$qc_mito <- qc_mito

# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot", 
           annotate = "qc_mito")

We also check what happens when we set the threshold too low.

# check spatial pattern of discarded spots if threshold is too high
qc_mito_25 <- colData(spe)$subsets_mito_percent > 25
colData(spe)$qc_mito_25 <- qc_mito_25
plotSpotQC(spe, plot_type = "spot", 
           annotate = "qc_mito_25")

6.5.4 Number of cells per spot

The number of cells per spot depends on the tissue type and organism.

Here, we check for any outlier values that could indicate problems during cell segmentation.

# histogram of cell counts
hist(colData(spe)$cell_count, breaks = 20)

# distribution of cells per spot
tbl_cells_per_spot <- table(colData(spe)$cell_count)

We see a tail of very high values, which could indicate problems for these spots. These values are also visible on the scatterplots. Here, we again plot the number of expressed genes vs. cell count, with an added trend.

# plot number of expressed genes vs. number of cells per spot
plotSpotQC(spe, plot_type = "scatter", 
           x_metric = "cell_count", y_metric = "detected", 
           x_threshold = 10)

##  `geom_smooth()` using formula = 'y ~ x'
##  `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
##  `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.

In particular, we see that the spots with very high cell counts also have low numbers of expressed genes. This indicates that the experiments have failed for these spots, and they should be removed.

We select a threshold of 10 cells per spot. The number of spots above this threshold is relatively small, and there is a clear downward trend in the number of expressed genes above this threshold.

# select QC threshold for number of cells per spot
qc_cell_count <- colData(spe)$cell_count > 10
table(qc_cell_count)

##  qc_cell_count
##  FALSE  TRUE 
##   3549    90

colData(spe)$qc_cell_count <- qc_cell_count

# check spatial pattern of discarded spots
plotSpotQC(spe, plot_type = "spot", 
           annotate = "qc_cell_count")

While there is a spatial pattern to the discarded spots, it does not appear to be correlated with the known biological features (cortical layers). The discarded spots are all on the edges of the tissue. It seems plausible that something has gone wrong with the cell segmentation on the edges of the images, so it makes sense to remove these spots.

6.5.5 Remove low-quality spots

Now that we have calculated several QC metrics and selected thresholds for each one, we can combine the sets of low-quality spots, and remove them from our object.

We also check again that the combined set of discarded spots does not correspond to any obvious biologically relevant group of spots.

# number of discarded spots for each metric
apply(cbind(qc_lib_size, qc_detected, qc_mito, qc_cell_count), 2, sum)

##    qc_lib_size   qc_detected       qc_mito qc_cell_count 
##              8             7            17            90

# combined set of discarded spots
discard <- qc_lib_size | qc_detected | qc_mito | qc_cell_count
table(discard)

##  discard
##  FALSE  TRUE 
##   3524   115

# store in object
colData(spe)$discard <- discard

# check spatial pattern of combined set of discarded spots
plotSpotQC(spe, plot_type = "spot", 
           annotate = "discard")

# remove combined set of low-quality spots
spe <- spe[, !colData(spe)$discard]
dim(spe)

##  [1] 33538  3524

6.6 Zero-cell and single-cell spots

A particular characteristic of Visium data is that spots can contain zero, one, or multiple cells.

We could also imagine other filtering procedures such as (i) removing spots with zero cells, or (ii) restricting the analysis to spots containing a single cell (which would make the data more similar to scRNA-seq).

However, this would discard a large amount of information. Below, we show the distribution of cells per spot again (up to the filtering threshold of 10 cells per spot from above).

# distribution of cells per spot
tbl_cells_per_spot[1:13]

##  
##    0   1   2   3   4   5   6   7   8   9  10  11  12 
##   84 211 483 623 617 541 421 287 140  92  50  25  18

# as proportions
prop_cells_per_spot <- round(tbl_cells_per_spot / sum(tbl_cells_per_spot), 2)
prop_cells_per_spot[1:13]

##  
##     0    1    2    3    4    5    6    7    8    9   10   11   12 
##  0.02 0.06 0.13 0.17 0.17 0.15 0.12 0.08 0.04 0.03 0.01 0.01 0.00

Only 6% of spots contain a single cell. If we restricted the analysis to these spots only, we would be discarding most of the data.

Removing the spots containing zero cells (2% of spots) would also be problematic, since these spots can also contain biologically meaningful information. For example, in this brain dataset, the regions between cell bodies consists of neuropil (dense networks of axons and dendrites). In our paper (Maynard et al. 2021), we explore the information contained in these neuropil spots.

6.7 Quality control at gene level

The sections above consider quality control at the spot level. In some datasets, it may also be appropriate to apply quality control procedures or filtering at the gene level. For example, certain genes may be biologically irrelevant for downstream analyses.

However, here we make a distinction between quality control and feature selection. Removing biologically uninteresting genes (such as mitochondrial genes) may also be considered as part of feature selection, since there is no underlying experimental procedure that has failed. Therefore, we will discuss gene-level filtering in the Chapter 8 chapter.

6.8 Save objects for later chapters

We also save the object(s) in .rds format for re-use within later chapters to speed up the build time of the book.

# save object(s)
saveRDS(spe, file = "spe_qc.rds")

References

Maynard, Kristen R., Leonardo Collado-Torres, Lukas M. Weber, Cedric Uytingco, Brianna K. Barry, Stephen R. Williams, Joseph L. Catallini II, et al. 2021. “Transcriptome-Scale Spatial Gene Expression in the Human Dorsolateral Prefrontal Cortex.” Nature Neuroscience 24: 425–36. https://doi.org/10.1038/s41593-020-00787-0.

McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Wills. 2017. “Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell RNA-seq Data in R.” Bioinformatics 33 (8): 1179–86. https://doi.org/10.1093/bioinformatics/btw777.