Skip to contents

A generative scheme for a single-cell count matrix with multiplicative batch effects

For each gene gg and cell jj, the gene expression YgjY_{gj} were sampled from Poisson distribution with the rate parameter:

Ξ»gj=Ξ»gjπ—Žπ—‡π–»π—‚π–Ίπ—Œπ–Ύπ–½Γ—βˆkΞ΄gkXkj,\lambda_{gj} = \lambda_{gj}^{\textsf{unbiased}} \times \prod_{k} \delta_{gk}^{X_{kj}},

affected by the batch effects Ξ΄gk\delta_{gk}. Letting XkjX_{kj} be a batch membership matrix, assigning a cell jj to a batch kk if and only if Xkj=1X_{kj}=1, we assume the average gene expression rates are linearly affected by in the log-transformed space:

𝔼[lnYgj]=ln(βˆ‘tΞ²gtΞΈjt)+βˆ‘klnΞ΄gkXkj.\mathbb{E}\!\left[\ln Y_{gj}\right] = \ln \left( \sum_{t} \beta_{gt} \theta_{jt} \right) + \sum_{k} \ln\delta_{gk} X_{kj}.

library(ggplot2)
set.seed(1331)

m <- 500 # genes
n <- 1000 # cells
nb <- 2 # batches

## 1. batch membership
X <- matrix(0, n, nb)
batch <- sample(nb, n, replace = TRUE)
for(b in 1:nb){
    X[batch == b, b] <- 1
}

## 2. batch effects
W.true <- matrix(rnorm(m*nb), m, nb)
ln.delta <- apply(W.true %*% t(X), 2, scale)

## 3. true effects
K <- 5
.beta <- matrix(rgamma(m * K, 1), m, K)
.theta <- matrix(rgamma(n * K, 1), n, K)
lambda.true <- .beta %*% t(.theta)
kk <- kmeans(log1p(.theta), 3, nstart=100)$cluster

lambda <- lambda.true * exp(ln.delta)
yy <- apply(lambda, 2, function(l) sapply(l, rpois, n=1))
oo <- order(apply(t(.theta), 2, which.max))
library(asapR)
y.data <- lapply(1:nb, function(b) yy[, batch == b, drop = F])
y.data <- lapply(y.data, function(y) Matrix::Matrix(y, sparse = T))

First, we can construct pseudo bulk profiles without considering batch labels, hoping that Poisson Matrix Factorization (PMF) method on the pseudo bulk data could identify meaningful topic-specific frequency patterns.

pb.out <- asap_random_bulk_cbind(y.data, num_factors = 9, do_batch_adj = F)
pmf.out <- asap_fit_pmf(pb.out$PB, maxK=5)

stat0 <- asap_pmf_regression(Matrix::Matrix(yy, sparse=T),
                             log_beta = pmf.out$log.beta,
                             beta_row_names = pb.out$rownames,
                             do_stdize_beta = T, do_stdize_r = T,
                             verbose = T)

We can explicitly incorporate experimental designs to the pseudo bulk data construction so that the we can deconfound batch effects and the resulting average matrix.

pb.out <- asap_random_bulk_cbind(y.data, num_factors = 9)
pmf.out <- asap_fit_pmf(pb.out$PB, maxK=5)

stat1 <- asap_pmf_regression(Matrix::Matrix(yy, sparse=T),
                             log_beta = pmf.out$log.beta,
                             beta_row_names = pb.out$rownames,
                             r_log_delta = pb.out$log.batch.effect,
                             do_stdize_beta = T, do_stdize_r = T,
                             verbose = T)

stat.b <- asap_pmf_regression(Matrix::Matrix(yy, sparse=T),
                              log_beta = pb.out$log.batch.effect,
                              beta_row_names = pb.out$rownames,
                              do_stdize_beta = T, do_stdize_r = T,
                              verbose = T)

Naive PMF analysis

topic0 <- pmf2topic(stat0$beta, stat0$theta)
plot.struct(topic0$prop, oo) +
    ggtitle("sorted by true topics")

When we sorted samples by true topic patterns, block structures should emerge if the PMF algorithm was applied to unbiased data. However, no apparent block patterns emerge.

topic.batch <- pmf2topic(stat.b$beta, stat.b$theta)
plot.struct(topic.batch$prop, order(batch)) +
    ggtitle("sorted by batch labels")

Instead, we found batch-specific patterns dominate the overall PMF results.

plot.struct(topic0$prop, order(batch)) +
    ggtitle("sorted by batch labels")

After adjusting batch-specific effects

On the other hand, we can recapitulate true topic-specific block structures from the deconfounded data. This was clearly demonstrated in the following PMF results.

topic1 <- pmf2topic(stat1$beta, stat1$theta)
plot.struct(topic1$prop, oo) +
    ggtitle("sorted by true topics")

Yet, we do not see any batch-specific effects.

plot.struct(topic1$prop, order(batch)) +
    ggtitle("sorted by batch labels")