3 Quality control

3.1 Background

Quality control (QC) procedures at the spot level aim to remove low-quality spots before further analysis. Low-quality spots can occur due to problems during library preparation or other experimental procedures. Examples include large proportions of dead cells due to cell damage during library preparation, and low mRNA capture efficiency due to inefficient reverse transcription or PCR amplification.

These spots are usually removed prior to further analysis, since otherwise they tend to create problems during downstream analyses such as clustering. For example, problematic spots that are not removed could show up as separate clusters, which may be misidentified as distinct cell types.

Low-quality spots can be identified according to several characteristics, including:

library size (i.e. total UMI counts per spot)
number of expressed features (i.e. number of genes with non-zero UMI counts per spot)
proportion of reads mapping to mitochondrial genes (a high proportion indicates cell damage)
number of cells per spot (unusually high values can indicate problems)

Low library size or low number of expressed features can indicate poor mRNA capture rates, e.g. due to cell damage and missing mRNAs, or low reaction efficiency. A high proportion of mitochondrial reads indicates cell damage, e.g. partial cell lysis leading to leakage and missing cytoplasmic mRNAs, with the resulting reads therefore concentrated on the remaining mitochondrial mRNAs that are relatively protected inside the mitochondrial membrane. Unusually high numbers of cells per spot can indicate problems during cell segmentation.

The first three characteristics listed above are also used for QC in scRNA-seq data. However, the expected distributions for high-quality spots are different (compared to high-quality cells in scRNA-seq), since spots may contain zero, one, or multiple cells.

3.2 Load data

library(SpatialExperiment)
library(STexampleData)

# load object
spe <- Visium_humanDLPFC()

3.3 Plot data

As an initial check, plot the spatial coordinates (spots) in x-y dimensions on the tissue slide, to check that the object has loaded correctly and that the orientation is as expected.

We use visualization functions from the ggspavis package to generate plots.

library(ggspavis)

# plot spatial coordinates (spots)
plotSpots(spe)

3.4 Calculate QC metrics

We calculate the QC metrics described above with a combination of methods from the scater (McCarthy et al. 2017) package (for metrics that are also used for scRNA-seq data, where we treat spots as equivalent to cells) and our own functions.

The QC metrics from scater can be calculated and added to the SpatialExperiment object as follows. Here, we also identify mitochondrial reads using their gene names, and pass these as an argument to scater.

First, we subset the object to keep only spots over tissue. The remaining spots are background spots, which we are not interested in.

library(scater)

# subset to keep only spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]
dim(spe)

[1] 33538  3639

# identify mitochondrial genes
is_mito <- grepl("(^MT-)|(^mt-)", rowData(spe)$gene_name)
table(is_mito)

is_mito
FALSE  TRUE 
33525    13

rowData(spe)$gene_name[is_mito]

 [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
 [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB"

# calculate per-spot QC metrics and store in colData
spe <- addPerCellQC(spe, subsets = list(mito = is_mito))
head(colData(spe))

DataFrame with 6 rows and 13 columns
                           barcode_id     sample_id in_tissue array_row
                          <character>   <character> <integer> <integer>
AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673         1        50
AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673         1         3
AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673         1        59
AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1 sample_151673         1        14
AAACAGCTTTCAGAAG-1 AAACAGCTTTCAGAAG-1 sample_151673         1        43
AAACAGGGTCTATATT-1 AAACAGGGTCTATATT-1 sample_151673         1        47
                   array_col ground_truth cell_count       sum  detected
                   <integer>  <character>  <integer> <numeric> <numeric>
AAACAAGTATCTCCCA-1       102       Layer3          6      8458      3586
AAACAATCTACTAGCA-1        43       Layer1         16      1667      1150
AAACACCAATAACTGC-1        19           WM          5      3769      1960
AAACAGAGCGACTCCT-1        94       Layer3          2      5433      2424
AAACAGCTTTCAGAAG-1         9       Layer5          4      4278      2264
AAACAGGGTCTATATT-1        13       Layer6          6      4004      2178
                   subsets_mito_sum subsets_mito_detected subsets_mito_percent
                          <numeric>             <numeric>            <numeric>
AAACAAGTATCTCCCA-1             1407                    13              16.6351
AAACAATCTACTAGCA-1              204                    11              12.2376
AAACACCAATAACTGC-1              430                    13              11.4089
AAACAGAGCGACTCCT-1             1316                    13              24.2223
AAACAGCTTTCAGAAG-1              651                    12              15.2174
AAACAGGGTCTATATT-1              621                    13              15.5095
                       total
                   <numeric>
AAACAAGTATCTCCCA-1      8458
AAACAATCTACTAGCA-1      1667
AAACACCAATAACTGC-1      3769
AAACAGAGCGACTCCT-1      5433
AAACAGCTTTCAGAAG-1      4278
AAACAGGGTCTATATT-1      4004

3.5 Selecting thresholds

The simplest option to apply the QC metrics is to select thresholds for each metric, and remove any spots that do not meet the thresholds for one or more metrics. Exploratory visualizations can be used to help select appropriate thresholds, which may differ depending on the dataset.

Here, we use visualizations to select thresholds for several QC metrics in our human DLPFC dataset: (i) library size, (ii) number of expressed genes, (iii) proportion of mitochondrial reads, and (iv) number of cells per spot.

3.5.1 Library size

Library size represents the total sum of UMI counts per spot. This is included in the column labeled sum in the scater output.

Plot a histogram of the library sizes across spots.

# histogram of library sizes
hist(colData(spe)$sum, breaks = 20)

The distribution is relatively smooth, and there are no obvious issue such as a spike at very low library sizes.

We also plot the library sizes against the number of cells per spot (which is available for this dataset). This is to check that we are not inadvertently removing a biologically meaningful group of spots. The horizontal line (argument threshold) shows our first guess at a possible filtering threshold for library size based on the histogram.

# plot library size vs. number of cells per spot
plotQC(spe, type = "scatter", 
       metric_x = "cell_count", metric_y = "sum", 
       threshold_y = 600)

`geom_smooth()` using formula = 'y ~ x'
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The plot shows that setting a filtering threshold for library size (e.g. at the value shown) does not appear to select for any obvious biologically consistent group of spots.

We set a relatively arbitrary threshold of 600 UMI counts per spot, and then check the number of spots below this threshold.

# select QC threshold for library size
qc_lib_size <- colData(spe)$sum < 600
table(qc_lib_size)

qc_lib_size
FALSE  TRUE 
 3631     8

colData(spe)$qc_lib_size <- qc_lib_size

Finally, we also check that the discarded spots do not have any obvious spatial pattern that correlates with known biological features. Otherwise, removing these spots could indicate that we have set the threshold too high, and are removing biologically informative spots.

# check spatial pattern of discarded spots
plotQC(spe, type = "spots", 
       discard = "qc_lib_size")

As an aside, here we can also illustrate what happens if we set the threshold too high. For example, if we set the threshold to 2000 UMI counts per spot – which may also seem like a reasonable value based on the histogram and scatterplot – then we see a possible spatial pattern in the discarded spots, matching the cortical layers. This illustrates the importance of interactively checking exploratory visualizations when choosing these thresholds.

# check spatial pattern of discarded spots if threshold is too high
qc_lib_size_2000 <- colData(spe)$sum < 2000
colData(spe)$qc_lib_size_2000 <- qc_lib_size_2000
plotQC(spe, type = "spots", 
       discard = "qc_lib_size_2000")

For reference, here are the ground truth (manually annotated) cortical layers in this dataset.

# plot ground truth (manually annotated) layers
plotSpots(spe, annotate = "ground_truth", 
          palette = "libd_layer_colors")

3.5.2 Number of expressed features

The number of expressed features refers to the number of genes with non-zero UMI counts per spot. This is stored in the column detected in the scater output.

We use a similar sequence of visualizations to choose a threshold for this QC metric.

# histogram of numbers of expressed genes
hist(colData(spe)$detected, breaks = 20)

# plot number of expressed genes vs. number of cells per spot
plotQC(spe, type = "scatter", 
       metric_x = "cell_count", metric_y = "detected", 
       threshold_y = 400)

`geom_smooth()` using formula = 'y ~ x'
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the plots, we select a threshold of 400 expressed genes per spot.

# select QC threshold for number of expressed genes
qc_detected <- colData(spe)$detected < 400
table(qc_detected)

qc_detected
FALSE  TRUE 
 3632     7

colData(spe)$qc_detected <- qc_detected

# check spatial pattern of discarded spots
plotQC(spe, type = "spots", 
       discard = "qc_detected")

Again, we also check what happens when we set the threshold too high.

# check spatial pattern of discarded spots if threshold is too high
qc_detected_1000 <- colData(spe)$detected < 1000
colData(spe)$qc_detected_1000 <- qc_detected_1000
plotQC(spe, type = "spots", 
       discard = "qc_detected_1000")

3.5.3 Proportion of mitochondrial reads

A high proportion of mitochondrial reads indicates cell damage.

We investigate the proportions of mitochondrial reads across spots, and select an appropriate threshold. The proportions of mitochondrial reads per spot are stored in the column subsets_mito_percent in the scater output.

# histogram of mitochondrial read proportions
hist(colData(spe)$subsets_mito_percent, breaks = 20)

# plot mitochondrial read proportion vs. number of cells per spot
plotQC(spe, type = "scatter", 
       metric_x = "cell_count", metric_y = "subsets_mito_percent", 
       threshold_y = 28)

`geom_smooth()` using formula = 'y ~ x'
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We select a threshold of 28% for the mitochondrial read proportion.

# select QC threshold for mitochondrial read proportion
qc_mito <- colData(spe)$subsets_mito_percent > 28
table(qc_mito)

qc_mito
FALSE  TRUE 
 3622    17

colData(spe)$qc_mito <- qc_mito

# check spatial pattern of discarded spots
plotQC(spe, type = "spots", 
       discard = "qc_mito")

We also check what happens when we set the threshold too low.

# check spatial pattern of discarded spots if threshold is too high
qc_mito_25 <- colData(spe)$subsets_mito_percent > 25
colData(spe)$qc_mito_25 <- qc_mito_25
plotQC(spe, type = "spots", 
       discard = "qc_mito_25")

3.5.4 Number of cells per spot

The number of cells per spot depends on the tissue type and organism.

Here, we check for any outlier values that could indicate problems during cell segmentation.

# histogram of cell counts
hist(colData(spe)$cell_count, breaks = 20)

# distribution of cells per spot
tbl_cells_per_spot <- table(colData(spe)$cell_count)

We see a tail of very high values, which could indicate problems for these spots. These values are also visible on the scatterplots. Here, we again plot the number of expressed genes vs. cell count, with an added trend.

# plot number of expressed genes vs. number of cells per spot
plotQC(spe, type = "scatter", 
       metric_x = "cell_count", metric_y = "detected", 
       threshold_x = 10)

`geom_smooth()` using formula = 'y ~ x'
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In particular, we see that the spots with very high cell counts also have low numbers of expressed genes. This indicates that the experiments have failed for these spots, and they should be removed.

We select a threshold of 10 cells per spot. The number of spots above this threshold is relatively small, and there is a clear downward trend in the number of expressed genes above this threshold.

# select QC threshold for number of cells per spot
qc_cell_count <- colData(spe)$cell_count > 10
table(qc_cell_count)

qc_cell_count
FALSE  TRUE 
 3549    90

colData(spe)$qc_cell_count <- qc_cell_count

# check spatial pattern of discarded spots
plotQC(spe, type = "spots", 
       discard = "qc_cell_count")

While there is a spatial pattern to the discarded spots, it does not appear to be correlated with the known biological features (cortical layers). The discarded spots are all on the edges of the tissue. It seems plausible that something has gone wrong with the cell segmentation on the edges of the images, so it makes sense to remove these spots.

3.5.5 Remove low-quality spots

Now that we have calculated several QC metrics and selected thresholds for each one, we can combine the sets of low-quality spots, and remove them from our object.

We also check again that the combined set of discarded spots does not correspond to any obvious biologically relevant group of spots.

# number of discarded spots for each metric
apply(cbind(qc_lib_size, qc_detected, qc_mito, qc_cell_count), 2, sum)

  qc_lib_size   qc_detected       qc_mito qc_cell_count 
            8             7            17            90

# combined set of discarded spots
discard <- qc_lib_size | qc_detected | qc_mito | qc_cell_count
table(discard)

discard
FALSE  TRUE 
 3524   115

# store in object
colData(spe)$discard <- discard

# check spatial pattern of combined set of discarded spots
plotQC(spe, type = "spots", 
       discard = "discard")

# remove combined set of low-quality spots
spe <- spe[, !colData(spe)$discard]
dim(spe)

[1] 33538  3524

3.6 Zero-cell and single-cell spots

A particular characteristic of Visium data is that spots can contain zero, one, or multiple cells.

We could also imagine other filtering procedures such as (i) removing spots with zero cells, or (ii) restricting the analysis to spots containing a single cell (which would make the data more similar to scRNA-seq).

However, this would discard a large amount of information. Below, we show the distribution of cells per spot again (up to the filtering threshold of 10 cells per spot from above).

# distribution of cells per spot
tbl_cells_per_spot[1:13]


  0   1   2   3   4   5   6   7   8   9  10  11  12 
 84 211 483 623 617 541 421 287 140  92  50  25  18

# as proportions
prop_cells_per_spot <- round(tbl_cells_per_spot / sum(tbl_cells_per_spot), 2)
prop_cells_per_spot[1:13]


   0    1    2    3    4    5    6    7    8    9   10   11   12 
0.02 0.06 0.13 0.17 0.17 0.15 0.12 0.08 0.04 0.03 0.01 0.01 0.00

Only 6% of spots contain a single cell. If we restricted the analysis to these spots only, we would be discarding most of the data.

Removing the spots containing zero cells (2% of spots) would also be problematic, since these spots can also contain biologically meaningful information. For example, in this brain dataset, the regions between cell bodies consists of neuropil (dense networks of axons and dendrites). In our paper (Maynard et al. 2021), we explore the information contained in these neuropil spots.

3.7 Quality control at gene level

The sections above consider quality control at the spot level. In some datasets, it may also be appropriate to apply quality control procedures or filtering at the gene level. For example, certain genes may be biologically irrelevant for downstream analyses.

However, here we make a distinction between quality control and feature selection. Removing biologically uninteresting genes (such as mitochondrial genes) may also be considered as part of feature selection, since there is no underlying experimental procedure that has failed. Therefore, we will discuss gene-level filtering in the [Feature selection] chapter.