Theoretical Background

This section provides the theoretical foundations for the cebmf_torch package.

This document summarizes the key statistical ideas behind the cebmf_torch implementation. It focuses on Empirical Bayes Matrix Factorization (EBMF) and the closely related Empirical Bayes Normal Means (EBNM) problems that are used as building blocks.

Empirical Bayes Normal Means (EBNM)

Before describing EBMF, we first describe the simpler EBNM problem. Suppose we observe independent observations \(y_1, \ldots, y_n\) with

\[y_i \sim \mathcal{N}(\theta_i, \sigma^2), \quad i = 1, \ldots, n.\]

Here, \(\theta_i\) is the unknown mean and \(\sigma^2\) is the known variance. We further assume that the \(\theta_i\) are drawn from some unknown prior distribution

\[\theta_i \sim g \in \mathcal{G},\]

where \(\mathcal{G}\) is some family of prior distributions.

If we want to find estimates of the \(\theta_i\), we could use the maximum likelihood estimates \(\hat{\theta}_i = y_i\). However, in an Empirical Bayes approach, we instead estimate the prior \(g\) from the data, and then use this estimated prior to compute the posterior \(p(\theta_i \mid y_i, \hat{g})\) and use the posterior mean as our estimate of \(\theta_i\).

We thus find the maximum likelihood estimate of \(g\):

\[\hat{g} = \arg \max \prod_i \int p(y_i \mid \theta_i, \sigma) g(\theta_i) {\rm d} \theta_i.\]

and then compute the posterior distribution \(p(\theta_i \mid y_i, \sigma, \hat{g})\).

This can significantly reduce the scatter between a prediction and the truth, when compared to using the maximum likelihood estimate \(\hat{\theta}_i = y_i\). See the Spiked EBNM Example for an example of this.

Empirical Bayes Matrix Factorization 

Now we have seen how the Empirical Bayes approach can help in parameter estimation, we now consider the more complex situation of matrix factorization. Matrix factorization provides a compact, interpretable representation of high-dimensional data. By approximating \(X\) with a small number of latent factors we reduce noise, highlight low-dimensional structure, and often obtain components that are easier to interpret than raw columns. In many applications (e.g., genomics, recommendation systems, or signal processing) the factors capture coherent patterns across samples or variables that correspond to biological states, user preferences, or repeated signals.

Let us suppose we have taken \(n\) observations of \(p\) variables, and stored these in a data matrix \(X \in \mathbb{R}^{n \times p}\). Given this data matrix, EBMF approximates \(X\) by a low-rank factor model

\[X \approx \sum_{k=1}^K \ell_k f_k ^T + E,\]

where \(\ell_k\) (\(n \times 1\)) and \(f_k\) (\(p \times 1\)) are the loading and factor vectors for factor \(k\), and \(E\) is noise. In matrix form, this can be written as \(X \approx L F^T + E\), where \(L\) (\(n \times K\)) and \(F\) (\(p \times K\)) are low-rank factors and are formed from the concatenation of the \(K\) columns of loadings and factors, respectively. In the EBMF factorization the matrix \(L_k\) contains loading columns which describe how strongly each of the \(n\) observations is associated with latent factor \(k\). The matrix \(F_k\) contains factor columns which describe the pattern of the factor across the \(p\) variables.

Empirical Bayes approaches place priors on the values of \(\ell_k\) and \(f_k\) and estimate prior hyperparameters from the data. Explicitly, if \(\ell_{k1}, \ldots, \ell_{kn}\) are the \(n\) entries of \(\ell_k\), and \(f_{k1}, \ldots, f_{kp}\) are the \(p\) entries of \(f_k\), then we say that

\[\begin{split}\ell_{k1}, \ldots, \ell_{kn} \sim g_{k}^{l} \in \mathcal{G}^{l}, \\ f_{k1}, \ldots, f_{kp} \sim g_{k}^{f} \in \mathcal{G}^{f}.\end{split}\]

where \(\mathcal{G}^{l}\) and \(\mathcal{G}^{f}\) are families of prior distributions and \(g_{k}^{l}\) and \(g_{k}^{f}\) are the specific priors for factor \(k\).

Iterative Fitting 

The EBMF model is typically fit using an iterative algorithm that loops over the factors \(k = 1, \ldots, K\). Suppose we have an estimate of all factors except \(\ell_k\) and \(f_k\). Then we can compute the residual matrix

\[R_k = X - \sum_{j \neq k} \ell_j f_j^T = \ell_k f_k^T + E.\]

In this case, \(R_k\) is a noisy observation of the rank-1 matrix \(\ell_k f_k^T\). By right-multiplying by \(f_k\) we obtain

\[R_k f_k = \ell_k (f_k^T f_k) + E f_k.\]

and thus the variable \(y_i\) can be defined to be

\[y_i = \left(\frac{R_k f_k}{f_k^T f_k}\right)_i = \ell_{ki} + e_i, \quad e_i \sim \mathcal{N}(0, s_i^2),\]

where \(s_i^2\) is the variance of the noise term \(E f_k / (f_k^T f_k)\). This is now exactly the EBNM problem described above, and we can use an EBNM solver to estimate \(g_k^l\) and the posterior distribution of \(\ell_{ki}\). The way of estimating \(f_k\) is completely analogous.

As a summary, the EBMF approach does the following:

Initialize \(L\) and \(F\) (for example using SVD).
For each factor \(k = 1, \ldots, K\)
1. Compute the residual matrix \(R_k\).
2. Solve the EBNM problem to estimate \(g_k^l\) and the posterior distribution of \(l_{ki}\).
3. Solve the EBNM problem to estimate \(g_k^f\) and the posterior distribution of \(f_{ki}\).
Repeat step 2 until convergence.

Key properties 

The method uses a variational approximation, where the posterior is factorized as \(q(l, f ) = q(l)q( f )\).
This leads to a well-defined objective function and ensures convergence of the algorithm.
The framework allows for highly flexible prior families; adding a new prior only requires solving the corresponding EBNM problem.
Sparsity is automatically controlled by the data during fitting—no need for cross-validation.
The algorithm is computationally efficient for large-scale problems and does not rely on MCMC.
If the prior family includes a delta function at zero, the model can automatically infer the rank \(K\).
For \(K > 1\), the method extends by iteratively adding or updating factors (deflation/backfitting).

Covariate Empirical Bayes Matrix Factorization (CEBMF)

In many applications, we have additional covariate information about the rows and/or columns of the data matrix \(X\). For example, if our data matrix contains information about the height, weight etc. of individuals, then we may also have information about their age, gender, or other demographic factors, which provides additional context that may help in the matrix factorization. We call this problem Covariate Empirical Bayes Matrix Factorization (CEBMF).

In this case, the parameters of our prior distributions on the factors can depend on the covariates. For example, if we had a simple Gaussian prior on the loadings, we could let the variance depend on the covariates:

\[l_{k1}, \ldots, l_{kn} \sim \mathcal{N}(0, \sigma_k^2(z_i)), \quad i = 1, \ldots, n.\]

where \(z_i\) is the covariate vector for observation \(i\) and \(\sigma_k^2(\cdot)\) is some function that maps covariates to variances. This could be the output of a neural network, or some simpler function such as a linear model. This problem now has the additional challenge of estimating the function \(\sigma_k^2(\cdot)\) from the data.

In the code, we define the covariates for \(L\) to be X_l and for \(F\) to be X_f.

Prior families 

We make use of many prior families in the code, which we define below.

ASH (Adaptive Shrinkage)

In the simplest example, we assume that the prior is a mixture of a point mass at zero and a mixture of zero-mean Gaussians

\[g(\cdot,) = \pi_0 \delta_0 (\cdot) + \sum_{j=1}^J \pi_j \mathcal{N}(\cdot, 0, \sigma_j^2),\]

where \(\pi_j\) are the mixture weights and \(\sigma_j^2\) are a fixed grid of variances. Here we learn the parameters \(\{\pi_j\}\), but the variances are fixed.

CASH (Covariate Adaptive Shrinkage)

The ASH model can be extended to allow the mixture weights to depend on covariates \(z\),

\[g(\cdot, z ) = \pi_0 (z) \delta_0 (\cdot) + \sum_{j=1}^J \pi_j (z) \mathcal{N}(\cdot, 0, \sigma_j^2),\]

where \(\pi_j(z)\) are the mixture weights that depend on covariates \(z\) and \(\sigma_j^2\) are a fixed grid of variances. Here we learn the functions \(\{\pi_j(z)\}\) by fitting neural networks, but the variances are fixed.

EMDN (Empirical Mixture Density Network)

Instead of restricting ourselves to a mixture of zero-mean Gaussians with known variance, we can use a more flexible prior family where the entire prior distribution depends on covariates \(z\) through a Mixture Density Network (MDN)

\[g(\cdot, z_i) = \sum_{j=1}^J \pi_j(z_i) \mathcal{N}(\cdot, \mu_j(z_i), \sigma_j^2(z_i)),\]

where \(\pi_j(z_i)\) are the mixture weights, \(\mu_j(z_i)\) are the means, and \(\sigma_j^2(z_i)\) are the variances of the mixture components.

Spiked-EMDN (Spiked Empirical Mixture Density Network)

The EMDN model can be expanded by including a point mass at zero in the mixture.

\[g(\cdot, z_i) = \pi_0(z_i) \delta_0(\cdot) + \sum_{j=1}^J \pi_j(z_i) \mathcal{N}(\cdot, \mu_j(z_i), \sigma_j^2(z_i)).\]

Although this is a special case of the EMDN model with one of the components having zero variance, it is numerically more stable to treat it separately.

CGB (Covariate Generalized Binary)

The Spiked-EMDN model can be simplified to a two-component mixture of a point mass at zero and a single Gaussian

\[g(\cdot, z_i) = \pi(z_i) \delta_0(\cdot) + (1 - \pi(z_i)) \mathcal{N}(\cdot, \mu, \sigma^2).\]

In this case we learn the function \(\pi(z_i)\) with a neural network, but simplify our task by treating \(\mu\) and \(\sigma^2\) as independent of \(z_i\) and learning them as global parameters. This model is useful when we believe that there are two fundamental populations in the data, and we want to learn the probability that each observation belongs to one of these populations.

Sharp_CGB (Sharp Covariate Generalized Binary)

The CBG model can be further simplified by relating the mean and variance of the Gaussian component to be

\[\sigma = \omega \vert \mu \vert.\]

This parameterization is useful when we believe that the variance of the non-zero component. We fix \(\omega\) to be a small constant and learn the mean \(\mu\) as a global parameter. In this case, we again learn the function \(\pi(z_i)\) with a neural network.

References 

Covariate-moderated Empirical Bayes Matrix Factorization: Denault, et al. 2025 NeuRIPS.
See also:
Intro to EBMF for a video explanation of EBMF.
Empirical Bayes Matrix Factorization: Wang & Stephens 2021 JMLR.
Non-negative matrix factorization: Lee & Seung 1999