29  Spatial statistics

29.1 Preamble

29.1.1 Introduction

Spatial statistics builds around the first law of geography of Tobler (1970) that states “everything is related to everything else, but near things are more related than distant things”. This dependence of spatial observations has been studied in the field of (geo)spatial statistics. In general, spatial dependence is estimated by comparing the values at one location with the values at another location that is a given distance (called spatial lag) away (Dale and Fortin 2014; Baddeley, Rubak, and Turner 2015).

The two technological streams, imaging- and sequencing-based assays, are very different in terms of data modalities. In imaging-based assays (e.g. CosMx and Xenium) the observations of interest (e.g. mRNAs) are recorded where they occur natively. This means the locations of the observations are governed by a stochastic process and can be approximated as a point process. Sequencing-based assays (e.g. Visium) however, are recorded along defined grids which is called a lattice. This lattice is not created by a stochastic process and therefore can not be approximated as a point process. Here, lattice data analysis methods have to be used. There are methods however, where cells can be segmented from very fine bins (e.g. Visium HD and IMC). These technologies can be approximated as being generated by a point process after segmentation (Emons et al. 2024; Baddeley, Rubak, and Turner 2015; Pebesma and Bivand 2023) In the following vignette, the two main exploratory spatial statistics streams, point pattern analysis and lattice data analysis will be introduced.

29.1.2 Dependencies

Code
library(dplyr)
library(scran)
library(spdep)
library(tidyr)
library(ggplot2)
library(Voyager)
library(SFEData)
library(spatstat)
library(openxlsx)
library(spatialFDA)
library(BiocParallel)
library(SpatialExperiment)
library(SpatialFeatureExperiment)

bp <- MulticoreParam(10)
set.seed(77)

sfe <- JanesickBreastData()

# load the official 10X annotations 
labels <- read.xlsx("https://cdn.10xgenomics.com/raw/upload/v1695234604/Xenium%20Preview%20Data/Cell_Barcode_Type_Matrices.xlsx", sheet = 4)

# add cell id to label dataframe
labels$Barcode <- rownames(labels)
# add the cell type labels to the spe
matchedDf <- as.data.frame(colData(sfe)) |>
  left_join(as.data.frame(labels), by = join_by("Barcode"))

colData(sfe) <- DataFrame(matchedDf)

sfe <- sfe[, colSums(counts(sfe)) > 0]
sfe <- sfe[, !is.na(sfe$Cluster)]
sfe <- logNormCounts(sfe)

xy <- spatialCoords(sfe)
# basic theme for spatial plots
theme_xy <- list(
    coord_equal(expand=FALSE), 
    theme_void(), theme(
        plot.margin=margin(l=5),
        legend.key=element_blank(),
        panel.background=element_rect(fill="black")))

29.2 Lattice data analysis

29.2.1 Introduction

In this chapter we will use SpatialFeatureExperiment which is a Bioconductor package that integrates the SpatialExperiment with geometric annotations that are represented as Simple Features in the sf.

Again, we will use the human breast cancer Xenium data set (Janesick et al. 2023) in a SpatialFeatureExperiment object available through SFEData.

We will use the Voyager package (Moses et al. 2023) which provides convenient wrappers around classic geospatial R packages such as spdep (Pebesma and Bivand 2023). For more details, please consult the authors’ comprehensive vignettes.

This part is based on the following vignette of the pasta resource (Emons et al. 2024).

29.2.2 The spatial weight matrix

Lattice based spatial methods rely on the concept of neighborhood. The neighborhood defines the spatial dependency between locations. Therefore, the first step of lattice based spatial analysis is the construction of a spatial weight matrix. Here we will use a nearest neighbor-based approach.

Different methods for the construction of the weight matrix exist, such as

  • contiguity-based neighbors (neighbors in direct contact),
  • graph-based neighbours (e.g., k-nearest neighbours),
  • distance-based neighbours,
  • higher order neighbours.

A detailed overview can be found in the documentation of the spdep package.

As different weight matrices influence downstream results, analysts should justify their choice of the weight matrix. A more detailed overview can be found in Pebesma and Bivand (2023).

Code
colGraph(sfe, "knn6") <-
  findSpatialNeighbors(
    sfe,
    type = "cellSeg",
    method = "knearneigh",
    k = 6
  )

plotColGraph(sfe,
  colGraphName = "knn6",
  colGeometryName = "cellSeg"
) + theme_void()

29.2.3 Spatial autocorrelation

Spatial autocorrelation measures similarity between spatial units (e.g., cells) while recognize that the units are not independent due to their spatial context. Spatial autocorrelation metrics can be global (summarizing the entire study area) of view or local (provide statistic for each unit). In addition, there exist methods for univariate and multivariate comparisons of continuous and categorical data Pebesma and Bivand (2023).

29.2.3.1 Univariate measures - Moran’s \(I\)

Moran’s \(I\) can be interpreted as the Pearson correlation between the value at a certain location and the average values of its neigbours. The global value is a weighted average of the respective local values Moran (1950).

Code
# get all gene probes
geneProbes <- rowData(sfe)[rowData(sfe)$Type == "Gene Expression", "Symbol"]

sfe <- runUnivariate(sfe,
  type = "moran",
  features = geneProbes,
  colGraphName = "knn6",
  BPPARAM = MulticoreParam(2)
)

We can output and plot the three genes with highest Moran’s \(I\).

Code
topGenes <- rownames(sfe)[order(rowData(sfe)$moran_sample01, decreasing = TRUE)[1:3]]
plotSpatialFeature(sfe, topGenes, ncol = 3)

We can further visualize this using when calculating and plotting local Moran’s \(I\) values Anselin (1995). The interpretation is analogue to the global counterpart. The higher the value, the more similar the expression among a cell’s neighbors. Negative values indicate local dissimilarity in expression.

Code
sfe <- runUnivariate(sfe,
  type = "localmoran",
  features = topGenes,
  colGraphName = "knn6",
  BPPARAM = MulticoreParam(2)
)

plotLocalResult(sfe,
                name = "localmoran",
                features = topGenes,
                colGeometryName = "centroids",
                divergent = TRUE,
                diverge_center = 0,
                ncol = 3)

During the interpretation of local autocorrelation measures, both the effect size (the value of the statistic) and the significance level should be considered. Because we calculate significance on each spot individually, values should be corrected for multiple testing.

Code
plotLocalResult(sfe,
                name = "localmoran",
                features = topGenes,
                attribute = "-log10p_adj",
                colGeometryName = "centroids",
                divergent = TRUE,
                diverge_center = 0,
                ncol = 3)

Multivariate measures – Lee’s L

Lee’s \(L\) combines the Pearson correlation coefficient and Moran’s \(I\) Lee (2001) This unified metric allows to evaluate how two continuous variables are related while accounting for their spatial dependencies. We will calculate the global metric on the 20 most highly (non-spatial) variable features.

Code
hvgs <- getTopHVGs(sfe, n = 20)

res <- calculateBivariate(sfe,
                          type = "lee",
                          feature1 = geneProbes,
                          colGraphName =  "knn6"
)

We will identify the gene pairs with highest Lee’s \(L\) value and then calculate and visualize the respective local measure.

Code
# select the 20 highest values in the res matrix
genePairs <- which(res >= tail(sort(res), n=20)[1], arr.ind=TRUE)
# take only pair of two genes
genePairs <- genePairs[genePairs[,1] != genePairs[,2],]
# 
data.frame(i = rownames(res)[genePairs[,1]], j = colnames(res)[genePairs[,2]],
           val = res[genePairs])
##        i     j       val
##  1 FOXA1 EPCAM 0.6453008
##  2  KRT7 EPCAM 0.6665447
##  3 FOXA1  FASN 0.6343032
##  4 EPCAM FOXA1 0.6453008
##  5  KRT7 FOXA1 0.6400049
##  6 EPCAM  KRT7 0.6665447
##  7 FOXA1  KRT7 0.6400049
##  8  KRT8  KRT7 0.6498543
##  9  KRT7  KRT8 0.6498543
Code
sfe <- runBivariate(
  sfe,
  "locallee",
  colGraphName =  "knn6",
  feature1 = c("EPCAM", "KRT7")
)

plotLocalResult(
  sfe,
  name = "locallee",
  features = "KRT7__EPCAM",
  colGeometryName = "centroids",
  divergent = TRUE,
  diverge_center = 0
)

29.2.4 Join count statistics

TODO with cluster labels

Code
resJc <- joincount.multi(as.factor(sfe$Cluster), colGraph(sfe, "knn6"))
resJc <- resJc[order(resJc[, "z-value"], decreasing=TRUE), ]
resJc[head(grep("Invasive", rownames(resJc)), 10), ]
##                                                   Joincount    Expected
##  Invasive_Tumor:Invasive_Tumor                 1.047167e+04 3501.509981
##  Prolif_Invasive_Tumor:Invasive_Tumor          2.313583e+03  770.133147
##  Prolif_Invasive_Tumor:Prolif_Invasive_Tumor   1.813333e+02   42.333854
##  T_Cell_&_Tumor_Hybrid:Invasive_Tumor          2.447500e+02  120.455145
##  T_Cell_&_Tumor_Hybrid:Prolif_Invasive_Tumor   2.833333e+01   13.246254
##  Prolif_Invasive_Tumor:Mast_Cells              8.333333e-01    3.723393
##  Prolif_Invasive_Tumor:LAMP3+_DCs              1.666667e+00    6.724674
##  Prolif_Invasive_Tumor:Perivascular-Like       9.750000e+00   19.045721
##  Prolif_Invasive_Tumor:IRF7+_DCs               2.916667e+00   11.079916
##  Stromal_&_T_Cell_Hybrid:Prolif_Invasive_Tumor 4.166667e+00   13.607310
##                                                   Variance    z-value
##  Invasive_Tumor:Invasive_Tumor                 357.2961209 368.746911
##  Prolif_Invasive_Tumor:Invasive_Tumor           95.2151465 158.175639
##  Prolif_Invasive_Tumor:Prolif_Invasive_Tumor     6.2786450  55.472815
##  T_Cell_&_Tumor_Hybrid:Invasive_Tumor           15.0959721  31.990617
##  T_Cell_&_Tumor_Hybrid:Prolif_Invasive_Tumor     1.9995863  10.669279
##  Prolif_Invasive_Tumor:Mast_Cells                0.5633442  -3.850525
##  Prolif_Invasive_Tumor:LAMP3+_DCs                1.0167052  -5.016282
##  Prolif_Invasive_Tumor:Perivascular-Like         2.8710572  -5.486080
##  Prolif_Invasive_Tumor:IRF7+_DCs                 1.6734335  -6.310428
##  Stromal_&_T_Cell_Hybrid:Prolif_Invasive_Tumor   2.0539122  -6.587349

29.3 Point Pattern Analysis

Point pattern analysis is a subfield of spatial statistics that focuses on the representation of spatial locations of events or objects as points (Baddeley, Rubak, and Turner 2015, 3). There are two main ways how to summarize cells as points (Emons et al. 2024):

  • approximate the features (mRNAs) as points.
  • segment the cells and represent the centroids as points.

The central R package to perform point pattern analysis is called spatstat (Baddeley and Turner 2005). The package spatialFDA creates an interface between SpatialExperiment objects and the spatstat library for easy integration into analysis workflows.

Part of this vignette is based on the pasta overview vignette and another vignette Emons et al. (2024)

The central object in spatstat is called ppp. This object contains three attributes:

  • the \(x\) and \(y\) coordinates of the points
  • the observation window of the pattern
  • marks which are associated with each point; this can be, e.g., a discrete cell type mark or a continuous gene expression mark

(Baddeley and Turner 2005)

In the following we will work with a Xenium dataset of breast cancer (Janesick et al. 2023). The representation of cells as points is done via cell centroids.

Code
spe <- sfe
rm(sfe)

In the plot below, the centroids of the cells are attributed with a discrete cell type mark.

Code
df <- data.frame(xy, colData(spe))
ggplot(df, aes(x_centroid, y_centroid, col=Cluster)) +
    guides(col=guide_legend(override.aes=list(size=2))) +
    theme_xy + theme(legend.key.size=ggplot2::unit(0, "pt")) +
    geom_point(size=0.05) 

29.3.1 Intensity

The first property to assess in a point pattern is the intensity. For a window \(W\) and points \(x\), the average intensity \(\bar{\lambda}\) is defined as the number of points \(n(x)\) divided by the area of the window \(|W|\):

\[ \bar{\lambda}=\frac{n(x)}{|W|} \]

The intensity of the points can be uniform in space which is called homogeneous. If the intensity is not uniform in space, it is called inhomogeneous. This distinction has important implications for the choice of the spatial metrics. Most metrics have a correction for inhomogeneous intensity of points (Baddeley, Rubak, and Turner 2015, 157 ff.).

Code
# TODO: temporary fix
spatialCoordsNames(spe) <- c("x", "y")
dfSpe <- .speToDf(spe)
pp <- .dfToppp(dfSpe, marks = "Cluster")

density(x=pp, sigma = bw.diggle) |> plot()

In the plot above we see that the intensity of all points is not uniform. This inhomogeneity of points has to be taken into accounts when interpreting spatial statistics metrics. For example, an indication of clustering in a point pattern can be due solely to an inhomogeneity of points. This is called the confounding between intensity and interaction (Baddeley, Rubak, and Turner 2015, 151 ff.).

29.3.2 Global Analysis

Global analyses summarize a statistic across the entire field of view. This means it reflects an average statistic and might not be reflective of local heterogeneities (Emons et al. 2024).

29.3.2.1 Correlation

One option in point pattern analysis is to analyse the correlation of marks. Like this, e.g., a clustering or spacing of cells can be determined relative to a completely spatially random (CSR) process.

Complete spatial randomness is the null scenario for a point pattern. It is characterized by two key properties:

  • Homogeneity: The intensity of points is homogeneous in space.
  • Independence: The points in one region do not influence the distribution of points in another region.

(Baddeley, Rubak, and Turner 2015, 199 ff.)

29.3.2.1.1 Ripley’s \(K\)

Ripley’s \(K\) is a well established function to assess correlation in a point pattern. It can be calculated within a mark or across marks. In essence, Ripley’s \(K\) quantifies the average number of points that fall in a \(r\)-neighbourhood of a chosen mark (Ripley 1976; Baddeley, Rubak, and Turner 2015, 132 ff.).

Code
resCross <- calcCrossMetricPerFov(
  spe,
  selection = c("DCIS_1", "DCIS_2", "Invasive_Tumor"),
  subsetby = 'sample_id',
  fun = 'Kcross',
  marks = 'Cluster',
  rSeq = seq(0, 500, length.out = 100),
  by = c("sample_id", "Sample")
)
##  [1] "Calculating Kcross from DCIS_1 to DCIS_1"
##  [1] "Calculating Kcross from DCIS_2 to DCIS_1"
##  [1] "Calculating Kcross from Invasive_Tumor to DCIS_1"
##  [1] "Calculating Kcross from DCIS_1 to DCIS_2"
##  [1] "Calculating Kcross from DCIS_2 to DCIS_2"
##  [1] "Calculating Kcross from Invasive_Tumor to DCIS_2"
##  [1] "Calculating Kcross from DCIS_1 to Invasive_Tumor"
##  [1] "Calculating Kcross from DCIS_2 to Invasive_Tumor"
##  [1] "Calculating Kcross from Invasive_Tumor to Invasive_Tumor"
Code
plotCrossMetricPerFov(
  resCross,
  theo = TRUE,
  correction = "border",
  x = "r",
  imageId = 'sample_id'
)
##  [[1]]

This plot shows Ripley’s \(K\) function corrected for inhomogeneities of the chosen marks. The diagonal is Ripley’s \(K\) function among the three cell types themselves and the off diagonal plots show the cross type combinations. We note that all cell types, ductal carcinoma in situ 1 (DCIS 1), ductal carcinoma in situ 2 (DCIS 2) and invasive tumor cells show a clear interaction among themselves, even when correcting for the inhomogeneity. DCIS 1 and invasive tumor cells show spacing, meaning there are less invasive tumor cells found in DCIS 1 than expect under CSR. However, DCIS 2 and invasive tumor cells show a distribution that is completely spatially random in a \(500 µm\) \(r\)-neighbourhood.

29.3.2.2 Spacing

A complementary approach to correlation analysis is spacing. There are three main distance types (Baddeley, Rubak, and Turner 2015, 255):

  • pairwise distances: distances between all pairs of points
  • nearest-neighbor distances: distance to the nearest point of the query point
  • empty-space distances: distance from a reference location to the nearest point
29.3.2.2.1 Nearest-neighbour distance function \(G\)

The nearest neighbor function \(G\) quantifies the average nearest-neighbor distance over a radius range \(r\) (Baddeley, Rubak, and Turner 2015, 262).

Code
resCross <- calcCrossMetricPerFov(
  spe,
  selection = c("DCIS_1", "DCIS_2", "Invasive_Tumor"),
  subsetby = 'sample_id',
  fun = 'Gcross',
  marks = 'Cluster',
  rSeq = seq(0, 100, length.out = 100),
  by = c("sample_id", "Sample")
)
##  [1] "Calculating Gcross from DCIS_1 to DCIS_1"
##  [1] "Calculating Gcross from DCIS_2 to DCIS_1"
##  [1] "Calculating Gcross from Invasive_Tumor to DCIS_1"
##  [1] "Calculating Gcross from DCIS_1 to DCIS_2"
##  [1] "Calculating Gcross from DCIS_2 to DCIS_2"
##  [1] "Calculating Gcross from Invasive_Tumor to DCIS_2"
##  [1] "Calculating Gcross from DCIS_1 to Invasive_Tumor"
##  [1] "Calculating Gcross from DCIS_2 to Invasive_Tumor"
##  [1] "Calculating Gcross from Invasive_Tumor to Invasive_Tumor"
Code
plotCrossMetricPerFov(
  resCross,
  theo = TRUE,
  correction = "km",
  x = "r",
  imageId = 'sample_id'
)
##  [[1]]

In terms of spacing, the interpretation is a bit different. Still, the diagonal shows that all cell types are more clustered than expected if the patterns were completely spatially random. Comparing now DCIS 1 and DCIS 2 with invasive tumor cells, we see that both are more spaced than expect at random. However, the spacing is stronger for DCIS 1 and invasive tumor cells. At small radii, DCIS 2 and invasive tumor are distributed close to random.

The analysis with Ripley’s \(K\) and the \(G\) function are not contradictory. \(G\) functions summarize shorter scale interactions than \(K\) functions (Baddeley, Rubak, and Turner 2015, 295).

29.3.3 Local Analysis

29.3.3.1 Local indicators of spatial association

The metrics shown before are an average over the entire window \(W\). Anselin (1995) proposed an alternative approach which is termed local indicators of spatial association (LISA). Instead of a global average LISA shows the local contributions of each point to the overall metric (Baddeley, Rubak, and Turner 2015, 247). This is a general concept not only for point pattern analysis but for lattice data analysis as well (Anselin 1995, 2019).

Code
# subset to only Invasive tumor cells 
ppSub <- subset(pp, marks %in% "Invasive_Tumor")
# restrict to a smaller window for computational reasons
Window(ppSub) <- owin(c(3500, 7524.087), c(1200, 5475.691))
# plot the point pattern
plot(ppSub)

Code
# calculate LISA K curves
resLocal <- localK(ppSub, verbose=FALSE) 

# code adapted from 
# https://robinsonlabuzh.github.io/
# pasta/01-imaging-univar-ppSOD.html

df <- resLocal |>
    as.data.frame() |>
    pivot_longer(
        iso0001:iso1408, 
        names_to="curve") 

sel <- df %>% 
    filter(r > 700.5630 & r < 702.4388) %>%
    mutate(sel=value) %>% select(curve, sel)

df <- df %>% left_join(sel)

thm <- list(
    theme_light(),
    theme(legend.position="none"),
    scale_colour_viridis_c())

p <- ggplot(df, aes(r, value, group=curve, col=sel)) +
    geom_line() +
    geom_line(aes(y=theo), linetype=2, col="darkgrey") +
    geom_vline(xintercept=700) +
    thm

df <- data.frame(
    x=ppSub$x, y=ppSub$y, 
    sel=unique(sel)$sel)

q <- ggplot(df, aes(x, y, col = sel)) +
    coord_equal(expand=FALSE) +
    geom_point(size = 1) + 
    thm
 
p | q

The LISA Ripley’s \(K\) are colored by their value at \(r = 700\). We note that the LISA Ripley’s \(K\) gives two populations of curves. Those curves that increase at radii \(r<500 µm\) above the CSR line indicated in grey and the other curves that remain either below the grey CSR line or increase after \(r>500µm\).

If the values at \(r=700\) of the curves are projected back into the physical space, we note that the curves above the grey CSR line are the highly clustered cells in the bottom left. The other curves are the more spaced cells.

29.4 Appendix

References

Anselin, Luc. 1995. “Local Indicators of Spatial AssociationLISA.” Geographical Analysis 27 (2): 93–115. https://doi.org/10.1111/j.1538-4632.1995.tb00338.x.
———. 2019. “A Local Indicator of Multivariate Spatial Association: Extending Geary’s c.” Geographical Analysis 51 (2): 133–50. https://doi.org/10.1111/gean.12164.
Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns. 1st ed. CRC Interdisciplinary Statistics Series. CRC Press, Taylor & Francis Group.
Baddeley, Adrian, and Rolf Turner. 2005. “Spatstat: An R Package for Analyzing Spatial Point Patterns.” Journal of Statistical Software 12 (January): 1–42. https://doi.org/10.18637/jss.v012.i06.
Dale, Mark R. T., and Marie-Josée Fortin. 2014. Spatial Analysis: A Guide for Ecologists. Second Edition. Cambridge ; New York: Cambridge University Press. https://doi.org/https://doi.org/10.1111/boj.12325.
Emons, Martin, Samuel Gunz, Helena L Crowell, Izaskun Mallona, Reinhard Furrer, and Mark D Robinson. 2024. “Pasta: Pattern Analysis for Spatial Omics Data.” arXiv Preprint arXiv:2412.01561.
Janesick, Amanda, Robert Shelansky, Andrew D. Gottscho, Florian Wagner, Stephen R. Williams, Morgane Rouault, Ghezal Beliakoff, et al. 2023. “High Resolution Mapping of the Tumor Microenvironment Using Integrated Single-Cell, Spatial and in Situ Analysis.” Nature Communications 14 (1): 8353.
Lee, Sang-Il. 2001. “Developing a Bivariate Spatial Association Measure: An Integration of Pearson’s r and Moran’s I.” Journal of Geographical Systems 3 (4): 369–85. https://doi.org/10.1007/s101090100064.
Moran, P. A. P. 1950. “Notes on Continuous Stochastic Phenomena.” Biometrika 37 (1/2): 17–23. https://doi.org/10.2307/2332142.
Moses, Lambda, Pétur Helgi Einarsson, Kayla Jackson, Laura Luebbert, A. Sina Booeshaghi, Sindri Antonsson, Nicolas Bray, Páll Melsted, and Lior Pachter. 2023. Voyager: Exploratory Single-Cell Genomics Data Analysis with Geospatial Statistics.” bioRxiv. https://doi.org/10.1101/2023.07.20.549945.
Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in R. 1st ed. New York: Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.
Ripley, Brian D. 1976. “The Second-Order Analysis of Stationary Point Processes.” Journal of Applied Probability 13 (2): 255–66. https://doi.org/10.2307/3212829.
Tobler, W. R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46: 234–40. https://doi.org/10.2307/143141.
Back to top