"Hierarchical/Multilevel modeling is a generalization of linear and generalized linear modeling in which regression coefficients are themselves given a model, whose parameters are also estimated from data" (Gelman 2006)
A flexible probabilistic programming language for statistical modeling along with a suite of inference tools for fitting models that are robust, scalable, and efficient
DSL for full (HMC, NUTS) & approximate bayesian inference written in C++
Interfaces for R, Python, Julia, Matlab & command line (plus MCMC analysis and visual summaries with shinyStan)
Example: Modeling standardized test scores of students, with data on the individual as well what class each individual is in.
student identifier $_{i}$
classroom identifier $_{k}$
standardized test scores $y_{ik}$
Lump everyone together and ignore classroom distinction
Estimate $\hat \theta = \bar y = \frac{ \sum_{i} \sum_{k} y_{ik} }{ \sum_{k} n_{k}}$
Treat each classroom separately
Estimate $\hat \theta_{k} = \bar y_{.k}$
With no pooling, we lose information across classrooms since many classrooms may be similar to each other
With complete pooling, we lose information based on classroom which may be very valuable
Natural hierarchy/structure makes HLM ideal
Partial Pooling
Amount of pooling determined by similarity of data
$P(A, B) = P(A) \times P(B | A)$
$P(A, B) = P(B) \times P(A | B)$
$P(A) \times P(B | A) = P(B) \times P(A | B)$
$\textit{Bayes' Rule}$: $P(B | A) = \frac{P(B) \times P(A | B)}{P(A)}$
$P(B | A) \propto P(B) \times P(A | B)$
"$posterior$" $\propto$ "$prior$" $\times$ "$likelihood$"
Back to our example...
prior: $\theta_{k} \sim N( \mu, \tau^2)$
likelihood: $\bar y_{.k} \sim N(\theta_{k}, \frac{\sigma^2}{n_{k}})$
posterior: $N( \mu, \tau^2) \times N(\theta_{k}, \frac{\sigma^2}{n_{k}}) = N(*mean, *variance)$
Estimate $\hat \theta_{k} = *mean$
$\hat \theta_{k} = \frac{ \frac{n_{k} \bar y_{.k}}{\sigma^2} + \frac{\mu}{\tau^2} }{ \frac{n_{k}}{\sigma^2} + \frac{1}{\tau^2} } = (1 - B_{k}) \bar y_{.k} + B_{k} \mu$
where $B_{k} = \frac{ \frac{1}{\tau^2} }{ \frac{n_{k}}{\sigma^2} + \frac{1}{\tau^2} } = \frac{ \frac{\sigma^2}{n_{k}} }{ \tau^2 + \frac{\sigma^2}{n_{k}} }$ is called the $\textit{shrinkage factor}$
If $\tau^2 >> \frac{\sigma^2}{n_{k}}$, then $B_{k} \to 0$ and $\hat \theta_{k} \to \bar y_{.k}$ (No Pooling)
If $\tau^2 << \frac{\sigma^2}{n_{k}}$, then $B_{k} \to 1$ and $\hat \theta_{k} \to \mu$ (Complete Pooling)
Example: Predict --------- ---
'complete pooling' won't work
not enough data for 'no pooling'
must use HLM
linear model $\implies$ $\hat \theta = \alpha + X\beta$
$X$ is predictor/covariate matrix
$\beta$ is vector of regression coefficients
$\alpha$ is intercept term
identifiers:
country $j$
region $l$
vertical $k$
industry $m$
$y_{ijklm}$ = --------- --- for advertisable $i$
$y_{ijlkm} \sim student(\nu, \alpha_{jklm} + X_{i}\beta_{jklm}, \sigma)$
$\nu$ and $\sigma$ are degrees of freedom, standard deviation parameters estimated in Stan
coefficients (and intercepts) vary by country/vertical
coefficent = 'country effect' + 'vertical effect'
$\beta_{jklm} = \gamma_{jl} + \gamma_{km}$
and in turn, the 'country effect' is made up of a 'country' and 'region' component
likewise the 'vertical effect' is made up of a 'vertical' and 'industry' component
$\gamma_{jl} = \delta_{j} + \delta_{l}$
$\gamma_{km} = \delta_{k} + \delta_{m}$
Country ($j$):
$\delta_{j} \sim MVN(0, \sigma_{j})$
Vertical ($k$):
$\delta_{k} \sim MVN(0, \sigma_{k})$
Region ($l$):
$\delta_{l} = \kappa_{l} + \mu_{l}$ where $\mu_{l} \sim MVN(0, \sigma_{l})$
Industry ($m$):
$\delta_{m} = \kappa_{m} + \mu_{m}$ where $\mu_{m} \sim MVN(0, \sigma_{m})$
Transformed Parameters:
$\alpha_{jklm}$, $\beta_{jklm}$, $\gamma_{jl}$, $\gamma_{km}$, $\delta_{l}$, $\delta_{m}$
Parameters:
$\delta_{j}$, $\delta_{k}$, $\kappa_{l}$, $\kappa_{m}$, $\mu_{l}$, $\mu_{m}$, $\sigma$, $\nu$
Hyperparameters:
$\sigma_{j}$, $\sigma_{k}$, $\sigma_{l}$, $\sigma_{m}$
Everything estimated in Stan
Noninformative priors/hyperpriors on parameters/hyperparameters:
$\nu \sim \Gamma(2, .1)$
$\sigma, \sigma_{j}, \sigma_{k}, \sigma_{l}, \sigma_{m} \sim cauchy(0, 2.5)$
Almost perfect transliteration of model diagram:
can predict at different spend levels by varying the --------- spend covariate (holding all others constant)
full posterior distribution of predicted values provide credible ranges at each level
result is a spend- ------ curve for each ------
Advantages:
improved predictive accuracy over traditional linear regression
works well with sparse data (more parameters than data points okay)
range of credible values given by posterior
Disadvantages:
MCMC is computationally expensive and not embarrassingly parallel (only individual chains can be run in parallel)
not "off the shelf"