🏡 index.nim

Bayesian Inference with Linear Model

At the time of this writing, Nim does not have any libraries for conducting Bayesian inference. However, it is quite easy for us to write all of the necessary code ourselves. This is a good exercise for learning about Bayesian inference and the syntax and speed of Nim make it an excellent choice for the this. In this tutorial we will walk through the different parts needed to perform Bayesian inference with a linear model. We will assume that you have some basic understanding of Bayesian inference already. There are many excellent introductions available in books and online.

Bayes theorem states:

$$ P(\theta|\text{Data}) = \frac{P(\text{Data}|\theta)P(\theta)}{P(\text{Data})}$$

Each of these terms are referred to as:

$$ \text{Posterior} = \frac{\text{Likelihood} \cdot \text{Prior}}{\text{Marginal Likelihood}}$$

A marginal probability $P(\text{Data}) $ can be discrete probability distribution where $P(\text{Data}) = \sum_{\theta}P(\text{Data}|\theta)P(\theta)$ so the posterior probability distribution is:

$$ P(\theta|\text{Data}) = \frac{P(\text{Data}|\theta)P(\theta)}{\sum_{\theta}P(\text{Data}|\theta)P(\theta)}$$

Or a marginal probability can be a continuous probability distribution where $P(\text{Data}) = \int d\theta P(\text{Data}|\theta)P(\theta)$ so the posterior probability distribution is:

$$ P(\theta|\text{Data}) = \frac{P(\text{Data}|\theta)P(\theta)}{\int d\theta P(\text{Data}|\theta)P(\theta)}$$

In this tutorial we will condsider a simple linear model $ y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i} $ where $\epsilon \sim N(0, \tau)$. Under this model, the parameters $\beta_0$ (y intercept) and $\beta_1$ (slope), describe the relationship between a predictor variable $x$ and a response variable $y$, with some unaccounted for residual random error ($\epsilon$) which is normally distributed with a mean of $0$ and standard deviation $\tau$.

We will estimate the values of the slope ($\beta_{0}$), the y-intercept ($\beta_{1}$), and the standard deviation of the residual random error ($\tau$) from observed $x$ and $y$ data.

Expressing this with Bayes rule gives us:

$$ \displaystyle P(\beta_{0}, \beta_{1}, \tau | Data) = \frac{P(Data|\beta_{0}, \beta_{1}, \tau) P(\beta_{0}, \beta_{1}, \tau)} {\iiint d\beta_{0} d\beta_{1} d\tau P(Data|\beta_{0}, \beta_{1}, \tau) P(\beta_{0}, \beta_{1}, \tau)} $$

Generate Data

We need some data to work with. Let's simulate data under the model wth $\beta_{0}=0$, $\beta_{1}=1$, and $\tau=1$:

$ y = 0 + 1x + \epsilon$ where $\epsilon \sim N(0, 1) $

import std/sequtils, std/random, std/stats
var
  n = 100
  b0 = 0.0
  b1 = 1.0
  sd = 1.0
  x = newSeq[float](n)
  y = newSeq[float](n)
for i in 0 ..< n:
  x[i] = rand(0.0..100.0)
  y[i] = b0 + (b1 * x[i]) + gauss(0.0, sd)

We can use ggplotnim to see what these data look like.

import datamancer, ggplotnim
var sim = toDf(x, y)
ggplot(sim, aes("x", "y")) +
  geom_point() +
  ggsave("images/simulated-data.png")

Priors

We need to choose prior probability distributions for each of the model parameters that we are estimating. Let's use a normal distribution for the priors on $\beta_{0}$ and $\beta_{1}$. The $\tau$ parameter must be a positive value greater than 0 so let's use the gamma distribution as the prior on $\tau$.

$$ \beta_{0} \sim Normal(\mu_{0}, \sigma^{2})$$ $$ \beta_{1} \sim Normal(\mu_{1}, \sigma^{2})$$ $$ \tau \sim Gamma(\kappa_{0}, \theta_{0})$$

To calculate the prior probabilities of proposed model parameter values, we will need the proability density functions for these distributions.

Normal PDF

$$ p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}} $$

Gamma PDF

$$ p(x) = \frac{1}{\Gamma(k)\theta^{k}} x^{k-1} e^{-\frac{x}{\theta}} $$

import std/math

proc normalPdf(x, m, s: float): float =
  result = E.pow((-0.5 * ((x - m) / s)^2)) / (s * sqrt(2.0 * PI))

proc gammaPdf(x, k, t: float): float =
  result = x.pow(k - 1.0) * E.pow(-(x / t)) / (gamma(k) * t.pow(k))

We now need to decide how to parameterize these prior probability densities. Since this is for the purpose of demonstration, let's use very informed priors so that we can quickly get a good sample from the posterior.

$$ \beta_{0} \sim Normal(0, 1)$$ $$ \beta_{1} \sim Normal(1, 1)$$ $$ \tau \sim Gamma(1, 1)$$

In a real analysis, the priors should encompass all values that we consider possible and it may also be a good idea to examine how sensitive the analysis is to the choice of priors.

type
  Normal = object
    mu: float
    sigma: float

  Gamma = object
    k: float
    sigma: float

  Priors = object
    b0: Normal
    b1: Normal
    sd: Gamma

var priors = Priors(
  b0: Normal(mu:0.0, sigma:10.0),
  b1: Normal(mu:1.0, sigma:10.0),
  sd: Gamma(k:1.0, sigma:10.0))

Now that we have prior probability density functions for each model parameter that we are estimating, we need to be able to compute the joint prior probability for all of the parameters of the model. We will actually be using the $ln$ of the probabilities to reduce rounding error since these values can be quite small.

proc logPrior(priors: Priors, b0, b1, sd: float ): float =
  let
    b0Prior = ln(normalPdf(b0, priors.b0.mu, priors.b0.sigma))
    b1Prior = ln(normalPdf(b1, priors.b1.mu, priors.b1.sigma))
    sdPrior = ln(gammaPdf(sd, priors.sd.k, priors.sd.sigma))
  result = b0Prior + b1Prior + sdPrior

Likelihood

We need to be able to calculate the likelihood of the observed $y_{i}$ values given the observed $x_{i}$ values and the model parameter values, $\beta_{0}$, $\beta_{1}$, and $\tau$.

We can write the model in a slightly different way:

$$\mu = \beta_{0} +\beta_{1} x$$ $$ y \sim N(\mu, \tau) $$

Then to compute the likelihood for a given set of $\beta_{0}$, $\beta_{1}$, $\tau$ parameters and data values $x_{i}$ and $y_{i}$ we use the normal probability density function which we wrote before to compute our prior probabilities. We will work with the $ln$ of the likelihood as we did with the priors.

proc logLikelihood(x, y: seq[float], b0, b1, sd: float): float =
  var likelihoods = newSeq[float](y.len)
  for i in 0 ..< y.len:
    let pred = b0 + (b1 * x[i])
    likelihoods[i] = ln(normalPdf(y[i], pred, sd))
  result = sum(likelihoods)

Posterior

We cannot analytically solve the posterior probability distribution of our linear model as the integration of the marginal likelihood is intractable.

But we can approximate it with markov chain monte carlo (MCMC) thanks to this property of Bayes rule:

$$ \displaystyle P(\beta_{0}, \beta_{1}, \tau | Data) \propto P(Data|\beta_{0}, \beta_{1}, \tau) P(\beta_{0}, \beta_{1}, \tau) $$

The marginal likelihood is a normalizing constant. Without it, we no longer have a probability density function. But the relative probability of a given set of parameter values to another set is the same. We only care about which parameter values are most probable so this is enough for us. We can determine which values have higher probability by randomly walking through parameter space while accepting or rejecting new values based on how probable they are.

proc logPosterior(x, y: seq[float], priors: Priors, b0, b1, sd: float): float =
  let
    like = logLikelihood(x, y, b0, b1, sd)
    prior = logPrior(priors, b0, b1, sd)
  result = like + prior

MCMC

We will use a Metropolis-Hastings algorithm to approximate the unnormalized posterior. The steps of this algorithm are as follows:

  1. Initialize arbitrary starting values for each model parameter.

  2. Compute the unnormalized posterior probability density for the initialized model parameter values given the observed data.

  3. Propose a new value for one of the model parameters by randomly drawing from a symetrical distribition centered on the present value. We will use a normal

distribution. 4. Compute the unnormalized posterior probabity density for the proposed model parameters values given the observed data. 5. Compute the ratio of the proposed probability density to the previous probability density. $ \alpha = \frac{P(\beta_{0}\prime, \beta_{1}\prime, \tau\prime | Data)}{P(\beta_{0}, \beta_{1}, \tau | Data)} $ This is called the acceptance ratio. 6. All proposals with greater probability than the current state are accepted. So a proposeal is accepted if the acceptance ratio is greater than 1. If the acceptance ratio is less than 1 then it is accepted with probability \alpha. In practice we can make a random draw from a uniform distribution ranging from 0 to 1 and accept proposals anytime the acceptance ratio is less than the random uniform variate, $ \alpha < r \sim Uniform(0, 1)$. 7. If a proposal is accepted then we will update the parameter in our model with a new state. Otherwise the state will remain the same as before. 8. We then repeat steps 3-7 until reaching a desired number of iterations that we believe give us an adequate sample from the unnormalized posterior distribution.

Note: When proposing values for $\tau$ from a normal distribution. It is possible for proposals to be less than one. However, the gamma prior probability distribution ranges from 0 to infinity meaning proposals of less than one are not valid and must be rejected. Fortunately, this approach still gives us a proper sample of the targeted unnormalized posterior probability distribution.

type
  MCMC = object
    x: seq[float]
    y: seq[float]
    nSamples: int
    priors: Priors
    propSd: float

  Samples = object
    n: int
    b0: seq[float]
    b1: seq[float]
    sd: seq[float]

proc run(m: MCMC, b0Start, b1Start, sdStart: float): Samples =
  var
    b0Samples = newSeq[float](m.nSamples)
    b1Samples = newSeq[float](m.nSamples)
    sdSamples = newSeq[float](m.nSamples)
    logProbs = newSeq[float](m.nSamples)
  b0Samples[0] = b0Start
  b1Samples[0] = b1Start
  sdSamples[0] = sdStart
  logProbs[0] = logPosterior(m.x, m.y, m.priors, b0Start, b1Start, sdStart)

  for i in 1..<m.nSamples:
    let
      b0Proposed = gauss(b0Samples[i-1], m.propSd)
      b1Proposed = gauss(b1Samples[i-1], m.propSd)
      sdProposed = gauss(sdSamples[i-1], m.propSd)
    if sdProposed > 0.0:
      var
        logProbProposed = logPosterior(m.x, m.y, m.priors, b0Proposed, b1Proposed, sdProposed)
        ratio = exp(logProbProposed - logProbs[i-1])
      if rand(1.0) < ratio:
        b0Samples[i] = b0proposed
        b1Samples[i] = b1proposed
        sdSamples[i] = sdproposed
        logProbs[i] = logProbProposed
      else:
        b0Samples[i] = b0Samples[i-1]
        b1Samples[i] = b1Samples[i-1]
        sdSamples[i] = sdSamples[i-1]
        logProbs[i] = logProbs[i-1]
    else:
      b0Samples[i] = b0Samples[i-1]
      b1Samples[i] = b1Samples[i-1]
      sdSamples[i] = sdSamples[i-1]
      logProbs[i] = logProbs[i-1]
  result = Samples(n:m.nSamples, b0:b0Samples, b1:b1Samples, sd:sdSamples)

Let's use this code to generate 100,000 samples. We'll cheat a bit and use the parameters that the data were simulated under as starting values to speed things up. A standard deviation of 0.1 seems to work pretty well for the proposal distribution of each parameter.

var
  mcmc = MCMC(x:x, y:y, nSamples:100000, priors:priors, propSd:0.1)
  samples = mcmc.run(0.0, 1.0, 1.0)

In some analyses, the posterior probability distribution could be multimodal which might make the MCMC chain sensitive to the starting value as it may get stuck in a local optimum and not sample from the full posterior distribtion. It is therefore a good idea to acquire multiple samples with different starting values to see if they converge on the same parameter estimates. Here we will run our MCMC one more time with some different starting values

var samples2 = mcmc.run(0.2, 1.01, 1.1)

Trace plots

We can get a sense for how well our mcmc performed and therefore gain some sense for how good our estimates might be by looking at the trace plots. Trace plots show the parameter values stored during each step in the mcmc chain. Either the accepted proposal value or the previous value if the proposal was rejected. Trace plots can be an unreliable indicator of mcmc performance so it is a good idea to assess it with other strategies as well.

import std/strformat

proc plotTraces(samples: seq[Samples], prefix: string) =
  var
    sample = newSeq[seq[int]](samples.len)
    chain = newSeq[seq[int]](samples.len)
  for i in 0 ..< samples.len:
    sample[i] = toSeq(1 .. samples[i].n)
    chain[i] = repeat(i+1, samples[i].n )
  var
    df = toDf({
      "sample": concat(sample),
      "chain": concat(chain),
      "b0": concat(map(samples, proc(x: Samples): seq[float] = x.b0)),
      "b1": concat(map(samples, proc(x: Samples): seq[float] = x.b1)),
      "sd": concat(map(samples, proc(x: Samples): seq[float] = x.sd))
    })
  ggplot(df, aes(x="sample", y="b0")) +
    geom_line(aes(color="chain")) +
    ggsave(fmt"{prefix}b0.png")

  ggplot(df, aes(x="sample", y="b1")) +
    geom_line(aes(color="chain")) +
    ggsave(fmt"{prefix}b1.png")

  ggplot(df, aes(x="sample", y="sd")) +
    geom_line(aes(color="chain")) +
    ggsave(fmt"{prefix}sd.png")

plotTraces(@[samples, samples2], "images/trace-")
INFO: The integer column `sample` has been automatically determined to be continuous. To overwrite this behavior add a `+ scale_x/y_discrete()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to. Or apply a `factor` to the column name in the `aes` call, i.e. `aes(..., factor("sample"), ...)`.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.
INFO: The integer column `sample` has been automatically determined to be continuous. To overwrite this behavior add a `+ scale_x/y_discrete()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to. Or apply a `factor` to the column name in the `aes` call, i.e. `aes(..., factor("sample"), ...)`.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.
INFO: The integer column `sample` has been automatically determined to be continuous. To overwrite this behavior add a `+ scale_x/y_discrete()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to. Or apply a `factor` to the column name in the `aes` call, i.e. `aes(..., factor("sample"), ...)`.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.

Burnin

Initially the mcmc chain may spend time exploring unlikely regions of parameter space. We can get a better approximation of the posterior if we exclude these early steps in the chain. These excluded samples are referred to as the burnin. A burnin of 10% seems to be more than enough with our informative priors and starting values.

proc burn(samples: Samples, p: float): Samples =
  var
    burnIx = (samples.n.float * p).ceil.int
    n = samples.n - burnIx
    b0 = samples.b0[burnIx..^1]
    b1 = samples.b1[burnIx..^1]
    sd = samples.sd[burnIx..^1]
  result = Samples(n:n, b0:b0, b1:b1, sd:sd)

var
  burnin1 = burn(samples, 0.1)
  burnin2 = burn(samples2, 0.1)

Histograms

Now that we have our post burnin samples, let's see what our posterior probability distributions look like for each model paramter.

proc plotHistograms(samples: seq[Samples], prefix: string) =
  var chain = newSeq[seq[int]](samples.len)
  for i in 0 ..< samples.len:
    chain[i] = repeat(i+1, samples[i].n )
  var
    df = toDf({
      "chain": concat(chain),
      "b0": concat(map(samples, proc(x: Samples): seq[float] = x.b0)),
      "b1": concat(map(samples, proc(x: Samples): seq[float] = x.b1)),
      "sd": concat(map(samples, proc(x: Samples): seq[float] = x.sd))
    })
  ggplot(df, aes(x="b0", fill="chain")) +
    geom_histogram(position="identity", alpha=some(0.5)) +
    ggsave(fmt"{prefix}b0.png")

  ggplot(df, aes(x="b1", fill="chain")) +
    geom_histogram(position="identity", alpha=some(0.5)) +
    ggsave(fmt"{prefix}b1.png")

  ggplot(df, aes(x="sd", fill="chain")) +
    geom_histogram(position="identity", alpha=some(0.5)) +
    ggsave(fmt"{prefix}sd.png")

plotHistograms(@[burnin1, burnin2], "images/hist-")
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.

Posterior Summarization

It's always great to visualize your results. But it's also useful to calculate some summary statistics for the posterior. There are several different ways you could do this.

Combine Chains

Before we summarize the posterior. Let's combine the samples into one. Since we have determined that our MCMC is performing well, we can be confident that each post burnin sample is a sample of the same distribution.

proc concat(samples: seq[Samples]): Samples =
  var
    n =  concat(map(samples, proc(x: Samples): int = x.n))
    b0 = concat(map(samples, proc(x: Samples): seq[float] = x.b0))
    b1 = concat(map(samples, proc(x: Samples): seq[float] = x.b1))
    sd = concat(map(samples, proc(x: Samples): seq[float] = x.sd))
  result = Samples(n:sum(n), b0:b0, b1:b1, sd:sd)

var burnin = concat(@[burnin1, burnin2])

Posterior means

One way to summarize the estimates of the posterior distributions is to simply calculate the mean. Let's see how close these values are to the true values of the parameters.

import stats

var
  meanB0 = mean(burnin.b0).round(3)
  meanB1 = mean(burnin.b1).round(3)
  meanSd = mean(burnin.sd).round(3)

echo "Mean b0: ", meanB0
echo "Mean b1: ", meanB1
echo "Mean sd: ", meanSd
Mean b0: 0.172
Mean b1: 0.997
Mean sd: 1.124

Credible Intervals

The means give us a point estimate for our parameter values but they tell us nothing about the uncertainty of our estimates. We can get a sense for that by looking at credible intervals. There are two widely used approaches for this, equal tailed intervals, and highest density intervals. These will often match each other closely when the target distribution is unimodal and symetric. We will calculate the 89% interval for each of these below. Why 89%? Why not? Credible interval threshold values are completely arbitrary.

Equal Tailed Interval

In this interval, the probability of being below the interval is equal to the probability of being above the interval.

import algorithm

proc quantile(samples: seq[float], interval: float): float =
  let
    s = sorted(samples, system.cmp[float])
    k = float(s.len - 1) * interval
    f = floor(k)
    c = ceil(k)
  if f == c:
    result = s[int(k)]
  else:
    let
      d0 = s[int(f)] * (c - k)
      d1 = s[int(c)] * (k - f)
    result = d0 + d1

proc eti(samples: seq[float], interval: float): (float, float) =
  let
    p = (1 - interval) / 2
  let
    q0 = quantile(samples, p)
    q1 = quantile(samples, 1 - p)
  result = (q0, q1)

var
  (b0EtiMin, b0EtiMax) = eti(burnin.b0, 0.89)
  (b1EtiMin, b1EtiMax) = eti(burnin.b1, 0.89)
  (sdEtiMin, sdEtiMax) = eti(burnin.sd, 0.89)

echo "Eti b0: ", b0EtiMin.round(3), " - ", b0EtiMax.round(3)
echo "Eti b1: ", b1EtiMin.round(3), " - ", b1EtiMax.round(3)
echo "Eti sd: ", sdEtiMin.round(3), " - ", sdEtiMax.round(3)
Eti b0: -0.178 - 0.527
Eti b1: 0.991 - 1.003
Eti sd: 1.005 - 1.26

Highest Posterior Density Interval

In this interval, all values inside of the interval have a higher probability density than values outside of the interval.

proc hdi(samples: seq[float], credMass: float): (float, float) =
  let
    sortedSamples = sorted(samples, system.cmp[float])
    ciIdxInc = int(floor(credMass * float(sortedSamples.len)))
    nCIs = sortedSamples.len - ciIdxInc
  var ciWidth = newSeq[float](nCIs)
  for i in 0..<nCIs:
    ciWidth[i] = sortedSamples[i + ciIdxInc] - sortedSamples[i]
  let
    minCiWidthIx = minIndex(ciWidth)
    hdiMin = sortedSamples[minCiWidthIx]
    hdiMax = sortedSamples[minCiWidthIx + ciIdxInc]
  result = (hdiMin, hdiMax)

var
  (b0HdiMin, b0HdiMax) = hdi(burnin.b0, 0.89)
  (b1HdiMin, b1HdiMax) = hdi(burnin.b1, 0.89)
  (sdHdiMin, sdHdiMax) = hdi(burnin.sd, 0.89)

echo "Hdi b0: ", b0HdiMin.round(3), " - ", b0HdiMax.round(3)
echo "Hdi b1: ", b1HdiMin.round(3), " - ", b1HdiMax.round(3)
echo "Hdi sd: ", sdHdiMin.round(3), " - ", sdHdiMax.round(3)
Hdi b0: -0.172 - 0.533
Hdi b1: 0.991 - 1.003
Hdi sd: 1.001 - 1.25

Standardize Data

The $\beta_{0}$ (intercept) and $\beta_{1}$ (slope) parameters of a linear model present a bit of a challenge for MCMC because believable values for them are tightly correlated. This means that a lot proposed values will be rejected and the chain will not move efficiently. An easy way to get around this problem is to standardize our data by rescaling them relative to their mean and standard deviation.

$$ \zeta_{x_{i}} = \frac{x_{i} - \bar{x}}{SD_{x}} $$ $$ \zeta_{y_{i}} = \frac{y_{i} - \bar{y}}{SD_{y}} $$

var
  xSt = newSeq[float](n)
  ySt = newSeq[float](n)
  xMean = mean(x)
  xSd = standardDeviation(x)
  yMean = mean(y)
  ySd = standardDeviation(y)

We can see that both the $y$ and $x$ values are now centered around zero and have the same scale.

var standardized = toDf(xSt, ySt)
ggplot(standardized, aes(x="xSt", y="ySt")) +
  geom_point() +
  ggsave("images/st-simulated-data.png")

We can now run the MCMC just as before with some minor changes. The standard deviation $\tau$ for the standardized data is going to be much smaller. Let's make the prior more informative. The standard deviation of the proposal distribution used before would result in most proposals being rejected so we should use a smaller value this time. Finally we should use some different starting values than before.

var
  st_priors = Priors(
    b0: Normal(mu:0.0, sigma:1.0),
    b1: Normal(mu:1.0, sigma:1.0),
    sd: Gamma(k:0.0035, sigma:1.0))
  st_mcmc = MCMC(x:xSt, y:ySt, nSamples:100000, priors:st_priors, propSd:0.001)
  st_samples1 = st_mcmc.run(0.0, 1.0, 0.0034)
  st_samples2 = st_mcmc.run(0.1, 1.01, 0.0036)

Convert back to original scale

To interpret these estimates we need to convert back to the original scale. $$ \beta_{0} = \zeta_{0} SD_{y} + \bar{y} - \zeta_{1} SD_{y} \bar{x} / SD_{x} $$ $$ \beta_{1} = \zeta_{1} SD_{y} / SD_{x} $$ $$ \tau = \zeta_{\tau} * SD_{y} $$

proc backTransform(samples: Samples, xMean, xSd, yMean, ySd: float): Samples =
  var
    b0 = newSeq[float](samples.n)
    b1 = newSeq[float](samples.n)
    sd = newSeq[float](samples.n)
  for i in 0 ..< samples.n:
    b0[i] = samples.b0[i] * ySd + yMean - samples.b1[i] * ySd * xMean / xSd
    b1[i] = samples.b1[i] * ySd / xSd
    sd[i] = samples.sd[i] * ySd
  result = Samples(n:samples.n, b0:b0, b1:b1, sd:sd)

var
  st_samples_trans1 = backTransform(st_samples1, xMean, xSd, yMean, ySd)
  st_samples_trans2 = backTransform(st_samples2, xMean, xSd, yMean, ySd)

Traceplots

Let's have a look at the trace plots.

plotTraces(@[st_samples_trans1, st_samples_trans2], "images/trace-st-")
INFO: The integer column `sample` has been automatically determined to be continuous. To overwrite this behavior add a `+ scale_x/y_discrete()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to. Or apply a `factor` to the column name in the `aes` call, i.e. `aes(..., factor("sample"), ...)`.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.
INFO: The integer column `sample` has been automatically determined to be continuous. To overwrite this behavior add a `+ scale_x/y_discrete()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to. Or apply a `factor` to the column name in the `aes` call, i.e. `aes(..., factor("sample"), ...)`.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.
INFO: The integer column `sample` has been automatically determined to be continuous. To overwrite this behavior add a `+ scale_x/y_discrete()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to. Or apply a `factor` to the column name in the `aes` call, i.e. `aes(..., factor("sample"), ...)`.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.

It looks like more of the proposals were accepted and we have a much better sample from the posterior.

Burnin

Let's get a post burnin sample as we did before.

var
  st_burnin1 = burn(st_samples_trans1, 0.1)
  st_burnin2 = burn(st_samples_trans2, 0.1)

Histograms

Now we can have a look at the posterior distribution of our estimates.

plotHistograms(@[st_burnin1, st_burnin2], "images/hist-st-")
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.
INFO: The integer column `chain` has been automatically determined to be discrete. To overwrite this behavior add a `+ scale_x/y_continuous()` call to the plotting chain. Choose `x` or `y` depending on which axis this column refers to.

These look pretty good but it's hard to know how they compare to the previous MCMC estimates. Let's summarize the posterior to get a better idea.

var st_burnin = concat(@[st_burnin1, st_burnin2])

var
  st_meanB0 = mean(st_burnin.b0).round(3)
  st_meanB1 = mean(st_burnin.b1).round(3)
  st_meanSd = mean(st_burnin.sd).round(3)

echo "Mean b0: ", meanB0
echo "Mean b1: ", meanB1
echo "Mean sd: ", meanSd

var
  (st_b0EtiMin, st_b0EtiMax) = eti(st_burnin.b0, 0.89)
  (st_b1EtiMin, st_b1EtiMax) = eti(st_burnin.b1, 0.89)
  (st_sdEtiMin, st_sdEtiMax) = eti(st_burnin.sd, 0.89)

echo "Eti b0: ", st_b0EtiMin.round(3), " - ", st_b0EtiMax.round(3)
echo "Eti b1: ", st_b1EtiMin.round(3), " - ", st_b1EtiMax.round(3)
echo "Eti sd: ", st_sdEtiMin.round(3), " - ", st_sdEtiMax.round(3)

var
  (st_b0HdiMin, st_b0HdiMax) = hdi(st_burnin.b0, 0.89)
  (st_b1HdiMin, st_b1HdiMax) = hdi(st_burnin.b1, 0.89)
  (st_sdHdiMin, st_sdHdiMax) = hdi(st_burnin.sd, 0.89)

echo "Hdi b0: ", st_b0HdiMin.round(3), " - ", st_b0HdiMax.round(3)
echo "Hdi b1: ", st_b1HdiMin.round(3), " - ", st_b1HdiMax.round(3)
echo "Hdi sd: ", st_sdHdiMin.round(3), " - ", st_sdHdiMax.round(3)
Mean b0: 0.172
Mean b1: 0.997
Mean sd: 1.124
Eti b0: -0.167 - 0.516
Eti b1: 0.991 - 1.003
Eti sd: 0.993 - 1.25
Hdi b0: -0.17 - 0.513
Hdi b1: 0.991 - 1.003
Hdi sd: 0.985 - 1.24

In comparison to the previous mcmc results shown below, it looks like the standardized analysis was actually pretty similar. But we can have greater confidence after using the standardized analysis because we got a better posterior sample.

echo "Mean b0: ", meanB0
echo "Mean b1: ", meanB1
echo "Mean sd: ", meanSd

echo "Eti b0: ", b0EtiMin.round(3), " - ", b0EtiMax.round(3)
echo "Eti b1: ", b1EtiMin.round(3), " - ", b1EtiMax.round(3)
echo "Eti sd: ", sdEtiMin.round(3), " - ", sdEtiMax.round(3)

echo "Hdi b0: ", b0HdiMin.round(3), " - ", b0HdiMax.round(3)
echo "Hdi b1: ", b1HdiMin.round(3), " - ", b1HdiMax.round(3)
echo "Hdi sd: ", sdHdiMin.round(3), " - ", sdHdiMax.round(3)
Mean b0: 0.172
Mean b1: 0.997
Mean sd: 1.124
Eti b0: -0.178 - 0.527
Eti b1: 0.991 - 1.003
Eti sd: 1.005 - 1.26
Hdi b0: -0.172 - 0.533
Hdi b1: 0.991 - 1.003
Hdi sd: 1.001 - 1.25

Final Note

We could have simply and more efficiently done all of this using least squares regression. However the Bayesian approach allows us to very easily and intuitively express uncertainty about our estimates and can be easily extended to much more complex models for which there are not such simple solutions. We could also incorporate prior knowledge or assumptions in a way not possible with frequentist approaches. The exercise above provides a starting point for this and shows just how easy it is to do with Nim!

import nimib

nbInit
nbDoc.useLatex

nbText: md"""
# Bayesian Inference with Linear Model

At the time of this writing, Nim does not have any libraries for
conducting Bayesian inference. However, it is quite easy for us to
write all of the necessary code ourselves. This is a good exercise for learning
about Bayesian inference and the syntax and speed of Nim make it an excellent
choice for the this. In this tutorial we will walk through
the different parts needed to perform Bayesian inference with a linear model.
We will assume that you have some basic understanding of Bayesian
inference already. There are many excellent introductions available in books
and online.

Bayes theorem states:

$$ P(\theta|\text{Data}) = \frac{P(\text{Data}|\theta)P(\theta)}{P(\text{Data})}$$

Each of these terms are referred to as:

$$ \text{Posterior} = \frac{\text{Likelihood} \cdot \text{Prior}}{\text{Marginal Likelihood}}$$

A marginal probability $P(\text{Data}) $ can be discrete probability distribution 
where $P(\text{Data}) = \sum_{\theta}P(\text{Data}|\theta)P(\theta)$ so the posterior
probability distribution is:

$$ P(\theta|\text{Data}) = \frac{P(\text{Data}|\theta)P(\theta)}{\sum_{\theta}P(\text{Data}|\theta)P(\theta)}$$

Or a marginal probability can be a continuous probability distribution
where $P(\text{Data}) = \int d\theta P(\text{Data}|\theta)P(\theta)$ so the posterior
probability distribution is:

$$ P(\theta|\text{Data}) = \frac{P(\text{Data}|\theta)P(\theta)}{\int d\theta P(\text{Data}|\theta)P(\theta)}$$

In this tutorial we will condsider a simple linear model
$ y_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i} $ where $\epsilon \sim N(0, \tau)$.
Under this model, the parameters $\beta_0$ (y intercept) and $\beta_1$ (slope),
describe the relationship between a
predictor variable $x$ and a response variable $y$, with some unaccounted for
residual random error ($\epsilon$) which is normally distributed with a mean of $0$ and
standard deviation $\tau$.

We will estimate the values of the slope ($\beta_{0}$), the y-intercept
($\beta_{1}$), and the standard deviation of the residual random error ($\tau$)
from observed $x$ and $y$ data.

Expressing this with Bayes rule gives us:

$$ \displaystyle P(\beta_{0}, \beta_{1}, \tau | Data) =
  \frac{P(Data|\beta_{0}, \beta_{1}, \tau) P(\beta_{0}, \beta_{1}, \tau)}
  {\iiint d\beta_{0} d\beta_{1} d\tau P(Data|\beta_{0}, \beta_{1}, \tau) P(\beta_{0}, \beta_{1}, \tau)} $$


# Generate Data
We need some data to work with. Let's simulate data
under the model wth $\beta_{0}=0$, $\beta_{1}=1$, and $\tau=1$:

$ y = 0 + 1x + \epsilon$ where $\epsilon \sim N(0, 1) $

"""

nbCode:
  import std/sequtils, std/random, std/stats
  var
    n = 100
    b0 = 0.0
    b1 = 1.0
    sd = 1.0
    x = newSeq[float](n)
    y = newSeq[float](n)
  for i in 0 ..< n:
    x[i] = rand(0.0..100.0)
    y[i] = b0 + (b1 * x[i]) + gauss(0.0, sd)

nbText: md"""
We can use `ggplotnim` to see what these data look like.
"""

nbCode:
  import datamancer, ggplotnim
  var sim = toDf(x, y)
  ggplot(sim, aes("x", "y")) +
    geom_point() +
    ggsave("images/simulated-data.png")
nbImage("images/simulated-data.png")

nbText: md"""
# Priors
We need to choose prior probability distributions for each of the model parameters
that we are estimating.  Let's use a normal distribution for the priors on
$\beta_{0}$ and $\beta_{1}$. The $\tau$ parameter must be a positive value
greater than 0 so let's use the gamma distribution as the prior on $\tau$.

$$ \beta_{0} \sim Normal(\mu_{0}, \sigma^{2})$$
$$ \beta_{1} \sim Normal(\mu_{1}, \sigma^{2})$$
$$ \tau \sim Gamma(\kappa_{0}, \theta_{0})$$

To calculate the prior probabilities of proposed model parameter values, we will need
the proability density functions for these distributions.

#### Normal PDF
$$ p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}} $$

#### Gamma PDF
$$ p(x) = \frac{1}{\Gamma(k)\theta^{k}} x^{k-1} e^{-\frac{x}{\theta}} $$
"""

nbCode:
  import std/math

  proc normalPdf(x, m, s: float): float =
    result = E.pow((-0.5 * ((x - m) / s)^2)) / (s * sqrt(2.0 * PI))

  proc gammaPdf(x, k, t: float): float =
    result = x.pow(k - 1.0) * E.pow(-(x / t)) / (gamma(k) * t.pow(k))

nbText: md"""
We now need to decide how to parameterize these prior probability densities.
Since this is for the purpose of demonstration, let's use very informed
priors so that we can quickly get a good sample from the posterior.

$$ \beta_{0} \sim Normal(0, 1)$$
$$ \beta_{1} \sim Normal(1, 1)$$
$$ \tau \sim Gamma(1, 1)$$

In a real analysis, the priors should encompass all values that we consider possible and
it may also be a good idea to examine how sensitive the analysis is to the choice
of priors.
"""

nbCode:
  type
    Normal = object
      mu: float
      sigma: float

    Gamma = object
      k: float
      sigma: float

    Priors = object
      b0: Normal
      b1: Normal
      sd: Gamma

  var priors = Priors(
    b0: Normal(mu:0.0, sigma:10.0),
    b1: Normal(mu:1.0, sigma:10.0),
    sd: Gamma(k:1.0, sigma:10.0))

nbText: md"""
Now that we have prior probability density functions for each model parameter
that we are estimating, we need to be able to compute the joint prior probability
for all of the parameters of the model.
We will actually be using the $ln$ of the probabilities to
reduce rounding error since these values can be quite small.
"""

nbCode:
  proc logPrior(priors: Priors, b0, b1, sd: float ): float =
    let
      b0Prior = ln(normalPdf(b0, priors.b0.mu, priors.b0.sigma))
      b1Prior = ln(normalPdf(b1, priors.b1.mu, priors.b1.sigma))
      sdPrior = ln(gammaPdf(sd, priors.sd.k, priors.sd.sigma))
    result = b0Prior + b1Prior + sdPrior

nbText: md"""
# Likelihood
We need to be able to calculate the likelihood of the observed $y_{i}$ values
given the observed $x_{i}$ values and the model parameter values, $\beta_{0}$,
$\beta_{1}$, and $\tau$.

We can write the model in a slightly different way:

$$\mu = \beta_{0} +\beta_{1} x$$
$$ y \sim N(\mu, \tau) $$

Then to compute the likelihood for a given set of $\beta_{0}$, $\beta_{1}$, $\tau$
parameters and data values $x_{i}$ and $y_{i}$ we use the normal probability
density function which we wrote before to compute our prior probabilities.
We will work with the $ln$ of the likelihood as we did with the priors.
"""

nbCode:
  proc logLikelihood(x, y: seq[float], b0, b1, sd: float): float =
    var likelihoods = newSeq[float](y.len)
    for i in 0 ..< y.len:
      let pred = b0 + (b1 * x[i])
      likelihoods[i] = ln(normalPdf(y[i], pred, sd))
    result = sum(likelihoods)

#TODO: Finish
nbText: md"""
# Posterior
We cannot analytically solve the posterior probability distribution of our
linear model as the integration of the marginal likelihood is intractable.

But we can approximate it with markov chain monte carlo (MCMC) thanks to this
property of Bayes rule:

$$ \displaystyle P(\beta_{0}, \beta_{1}, \tau | Data) \propto
  P(Data|\beta_{0}, \beta_{1}, \tau) P(\beta_{0}, \beta_{1}, \tau) $$

The marginal likelihood is a normalizing constant. Without it, we no longer
have a probability density function. But the relative probability of a given set
of parameter values to another set is the same. We only care about which parameter
values are most probable so this is enough for us. We can determine which values
have higher probability by randomly walking through parameter space while accepting
or rejecting new values based on how probable they are.

"""

nbCode:
  proc logPosterior(x, y: seq[float], priors: Priors, b0, b1, sd: float): float =
    let
      like = logLikelihood(x, y, b0, b1, sd)
      prior = logPrior(priors, b0, b1, sd)
    result = like + prior

nbText: md"""
# MCMC
We will use a Metropolis-Hastings algorithm to approximate the unnormalized posterior.
The steps of this algorithm are as follows:
1. Initialize arbitrary starting values for each model parameter.
2. Compute the unnormalized posterior probability density for the initialized
model parameter values given the observed data.
3. Propose a new value for one of the model parameters by randomly drawing from
a symetrical distribition centered on the present value. We will use a normal
distribution. 
4. Compute the unnormalized posterior probabity density for the proposed model
parameters values given the observed data. 
5. Compute the ratio of the proposed probability density to the previous
probability density. $ \alpha = \frac{P(\beta_{0}\prime, \beta_{1}\prime, \tau\prime | Data)}{P(\beta_{0}, \beta_{1}, \tau | Data)} $
This is called the acceptance ratio. 
6. All proposals with greater probability than the current state are accepted.
So a proposeal is accepted if the acceptance ratio is greater than 1. If the
acceptance ratio is less than 1 then it is accepted with probability \alpha. In practice we can make a
random draw from a uniform distribution ranging from 0 to 1 and accept proposals
anytime the acceptance ratio is less than the random uniform variate, $ \alpha < r \sim Uniform(0, 1)$. 
7. If a proposal is accepted then we will update the parameter in our model with
a new state. Otherwise the state will remain the same as before. 
8. We then repeat steps 3-7 until reaching a desired number of iterations that
we believe give us an adequate sample from the unnormalized posterior distribution. 

Note: When proposing values for $\tau$ from a normal distribution. It is possible for
proposals to be less than one. However, the gamma prior probability distribution
ranges from 0 to infinity meaning proposals of less than one are not valid and must
be rejected. Fortunately, this approach still gives us a proper sample of the targeted unnormalized posterior
probability distribution.

"""

nbCode:
  type
    MCMC = object
      x: seq[float]
      y: seq[float]
      nSamples: int
      priors: Priors
      propSd: float

    Samples = object
      n: int
      b0: seq[float]
      b1: seq[float]
      sd: seq[float]

  proc run(m: MCMC, b0Start, b1Start, sdStart: float): Samples =
    var
      b0Samples = newSeq[float](m.nSamples)
      b1Samples = newSeq[float](m.nSamples)
      sdSamples = newSeq[float](m.nSamples)
      logProbs = newSeq[float](m.nSamples)
    b0Samples[0] = b0Start
    b1Samples[0] = b1Start
    sdSamples[0] = sdStart
    logProbs[0] = logPosterior(m.x, m.y, m.priors, b0Start, b1Start, sdStart)

    for i in 1..<m.nSamples:
      let
        b0Proposed = gauss(b0Samples[i-1], m.propSd)
        b1Proposed = gauss(b1Samples[i-1], m.propSd)
        sdProposed = gauss(sdSamples[i-1], m.propSd)
      if sdProposed > 0.0:
        var
          logProbProposed = logPosterior(m.x, m.y, m.priors, b0Proposed, b1Proposed, sdProposed)
          ratio = exp(logProbProposed - logProbs[i-1])
        if rand(1.0) < ratio:
          b0Samples[i] = b0proposed
          b1Samples[i] = b1proposed
          sdSamples[i] = sdproposed
          logProbs[i] = logProbProposed
        else:
          b0Samples[i] = b0Samples[i-1]
          b1Samples[i] = b1Samples[i-1]
          sdSamples[i] = sdSamples[i-1]
          logProbs[i] = logProbs[i-1]
      else:
        b0Samples[i] = b0Samples[i-1]
        b1Samples[i] = b1Samples[i-1]
        sdSamples[i] = sdSamples[i-1]
        logProbs[i] = logProbs[i-1]
    result = Samples(n:m.nSamples, b0:b0Samples, b1:b1Samples, sd:sdSamples)

nbText: md"""
Let's use this code to generate 100,000 samples. We'll cheat a bit and use the parameters
that the data were simulated under as starting values to speed things up. A standard deviation of 0.1
seems to work pretty well for the proposal distribution of each parameter.
"""

nbCode:
  var
    mcmc = MCMC(x:x, y:y, nSamples:100000, priors:priors, propSd:0.1)
    samples = mcmc.run(0.0, 1.0, 1.0)

nbText: md"""
In some analyses, the posterior probability distribution could be multimodal
which might make the MCMC chain sensitive to the starting value as it may get stuck
in a local optimum and not sample from the full posterior distribtion. It is therefore
a good idea to acquire multiple samples with different starting values to see if they
converge on the same parameter estimates. Here we will run our MCMC one more time
with some different starting values
"""

nbCode:
  var samples2 = mcmc.run(0.2, 1.01, 1.1)

nbText: md"""
# Trace plots
We can get a sense for how well our mcmc performed and therefore gain some
sense for how good our estimates might be by looking at the trace plots. Trace plots
show the parameter values stored during each step in the mcmc chain. Either
the accepted proposal value or the previous value if the proposal was rejected. Trace
plots can be an unreliable indicator of mcmc performance so it is a good
idea to assess it with other strategies as well.
"""

nbCode:
  import std/strformat

  proc plotTraces(samples: seq[Samples], prefix: string) =
    var
      sample = newSeq[seq[int]](samples.len)
      chain = newSeq[seq[int]](samples.len)
    for i in 0 ..< samples.len:
      sample[i] = toSeq(1 .. samples[i].n)
      chain[i] = repeat(i+1, samples[i].n )
    var
      df = toDf({
        "sample": concat(sample),
        "chain": concat(chain),
        "b0": concat(map(samples, proc(x: Samples): seq[float] = x.b0)),
        "b1": concat(map(samples, proc(x: Samples): seq[float] = x.b1)),
        "sd": concat(map(samples, proc(x: Samples): seq[float] = x.sd))
      })
    ggplot(df, aes(x="sample", y="b0")) +
      geom_line(aes(color="chain")) +
      ggsave(fmt"{prefix}b0.png")

    ggplot(df, aes(x="sample", y="b1")) +
      geom_line(aes(color="chain")) +
      ggsave(fmt"{prefix}b1.png")

    ggplot(df, aes(x="sample", y="sd")) +
      geom_line(aes(color="chain")) +
      ggsave(fmt"{prefix}sd.png")

  plotTraces(@[samples, samples2], "images/trace-")

nbImage("images/trace-b0.png")
nbImage("images/trace-b1.png")
nbImage("images/trace-sd.png")

nbText: md"""
# Burnin
Initially the mcmc chain may spend time exploring unlikely regions of
parameter space. We can get a better approximation of the posterior if we
exclude these early steps in the chain. These excluded samples are referred to
as the burnin. A burnin of 10% seems to be more than enough with our informative priors
and starting values.
"""

nbCode:
  proc burn(samples: Samples, p: float): Samples =
    var
      burnIx = (samples.n.float * p).ceil.int
      n = samples.n - burnIx
      b0 = samples.b0[burnIx..^1]
      b1 = samples.b1[burnIx..^1]
      sd = samples.sd[burnIx..^1]
    result = Samples(n:n, b0:b0, b1:b1, sd:sd)

  var
    burnin1 = burn(samples, 0.1)
    burnin2 = burn(samples2, 0.1)

nbText: md"""
# Histograms
Now that we have our post burnin samples, let's see what our posterior probability
distributions look like for each model paramter.
"""

nbCode:
  proc plotHistograms(samples: seq[Samples], prefix: string) =
    var chain = newSeq[seq[int]](samples.len)
    for i in 0 ..< samples.len:
      chain[i] = repeat(i+1, samples[i].n )
    var
      df = toDf({
        "chain": concat(chain),
        "b0": concat(map(samples, proc(x: Samples): seq[float] = x.b0)),
        "b1": concat(map(samples, proc(x: Samples): seq[float] = x.b1)),
        "sd": concat(map(samples, proc(x: Samples): seq[float] = x.sd))
      })
    ggplot(df, aes(x="b0", fill="chain")) +
      geom_histogram(position="identity", alpha=some(0.5)) +
      ggsave(fmt"{prefix}b0.png")

    ggplot(df, aes(x="b1", fill="chain")) +
      geom_histogram(position="identity", alpha=some(0.5)) +
      ggsave(fmt"{prefix}b1.png")

    ggplot(df, aes(x="sd", fill="chain")) +
      geom_histogram(position="identity", alpha=some(0.5)) +
      ggsave(fmt"{prefix}sd.png")

  plotHistograms(@[burnin1, burnin2], "images/hist-")

nbImage("images/hist-b0.png")
nbImage("images/hist-b1.png")
nbImage("images/hist-sd.png")

nbText: md"""
# Posterior Summarization
It's always great to visualize your results. But it's also useful to calculate some
summary statistics for the posterior. There are several different ways you could
do this.

## Combine Chains
Before we summarize the posterior. Let's combine the samples into one.
Since we have determined that our MCMC is performing well, we can be confident
that each post burnin sample is a sample of the same distribution.
"""

nbCode:
  proc concat(samples: seq[Samples]): Samples =
    var
      n =  concat(map(samples, proc(x: Samples): int = x.n))
      b0 = concat(map(samples, proc(x: Samples): seq[float] = x.b0))
      b1 = concat(map(samples, proc(x: Samples): seq[float] = x.b1))
      sd = concat(map(samples, proc(x: Samples): seq[float] = x.sd))
    result = Samples(n:sum(n), b0:b0, b1:b1, sd:sd)

  var burnin = concat(@[burnin1, burnin2])

nbText: md"""
## Posterior means
One way to summarize the estimates of the posterior distributions is to simply calculate
the mean. Let's see how close these values are to the true values of the parameters.
 """

nbCode:
  import stats

  var
    meanB0 = mean(burnin.b0).round(3)
    meanB1 = mean(burnin.b1).round(3)
    meanSd = mean(burnin.sd).round(3)

  echo "Mean b0: ", meanB0
  echo "Mean b1: ", meanB1
  echo "Mean sd: ", meanSd

nbText: md"""
## Credible Intervals
The means give us a point estimate for our parameter values but they tell us
nothing about the uncertainty of our estimates. We can get a sense for that by
looking at credible intervals. There are two widely used approaches for this,
equal tailed intervals, and highest density intervals. These will often match
each other closely when the target distribution is unimodal and symetric.
We will calculate the 89% interval for each of these below. Why 89%? Why not?
Credible interval threshold values are completely arbitrary.

### Equal Tailed Interval
In this interval, the probability of being below the interval is equal to the 
probability of being above the interval.
"""

nbCode:
  import algorithm

  proc quantile(samples: seq[float], interval: float): float =
    let
      s = sorted(samples, system.cmp[float])
      k = float(s.len - 1) * interval
      f = floor(k)
      c = ceil(k)
    if f == c:
      result = s[int(k)]
    else:
      let
        d0 = s[int(f)] * (c - k)
        d1 = s[int(c)] * (k - f)
      result = d0 + d1

  proc eti(samples: seq[float], interval: float): (float, float) =
    let
      p = (1 - interval) / 2
    let
      q0 = quantile(samples, p)
      q1 = quantile(samples, 1 - p)
    result = (q0, q1)

  var
    (b0EtiMin, b0EtiMax) = eti(burnin.b0, 0.89)
    (b1EtiMin, b1EtiMax) = eti(burnin.b1, 0.89)
    (sdEtiMin, sdEtiMax) = eti(burnin.sd, 0.89)

  echo "Eti b0: ", b0EtiMin.round(3), " - ", b0EtiMax.round(3)
  echo "Eti b1: ", b1EtiMin.round(3), " - ", b1EtiMax.round(3)
  echo "Eti sd: ", sdEtiMin.round(3), " - ", sdEtiMax.round(3)


nbText: md"""
### Highest Posterior Density Interval
In this interval, all values inside of the interval have a higher probability 
density than values outside of the interval.
"""

nbCode:
  proc hdi(samples: seq[float], credMass: float): (float, float) =
    let
      sortedSamples = sorted(samples, system.cmp[float])
      ciIdxInc = int(floor(credMass * float(sortedSamples.len)))
      nCIs = sortedSamples.len - ciIdxInc
    var ciWidth = newSeq[float](nCIs)
    for i in 0..<nCIs:
      ciWidth[i] = sortedSamples[i + ciIdxInc] - sortedSamples[i]
    let
      minCiWidthIx = minIndex(ciWidth)
      hdiMin = sortedSamples[minCiWidthIx]
      hdiMax = sortedSamples[minCiWidthIx + ciIdxInc]
    result = (hdiMin, hdiMax)

  var
    (b0HdiMin, b0HdiMax) = hdi(burnin.b0, 0.89)
    (b1HdiMin, b1HdiMax) = hdi(burnin.b1, 0.89)
    (sdHdiMin, sdHdiMax) = hdi(burnin.sd, 0.89)

  echo "Hdi b0: ", b0HdiMin.round(3), " - ", b0HdiMax.round(3)
  echo "Hdi b1: ", b1HdiMin.round(3), " - ", b1HdiMax.round(3)
  echo "Hdi sd: ", sdHdiMin.round(3), " - ", sdHdiMax.round(3)

nbText: md"""
# Standardize Data
The $\beta_{0}$ (intercept) and $\beta_{1}$ (slope) parameters of a linear model
present a bit of a challenge for MCMC because believable values for them are
tightly correlated. This means that a lot proposed values will be rejected and the
chain will not move efficiently. An easy way to get around this problem is
to standardize our data by rescaling them relative to their mean and standard deviation.

$$ \zeta_{x_{i}} = \frac{x_{i} - \bar{x}}{SD_{x}} $$
$$ \zeta_{y_{i}} = \frac{y_{i} - \bar{y}}{SD_{y}} $$
"""

nbCode:
  var
    xSt = newSeq[float](n)
    ySt = newSeq[float](n)
    xMean = mean(x)
    xSd = standardDeviation(x)
    yMean = mean(y)
    ySd = standardDeviation(y)
for i in 0 ..< n:
  xSt[i] = (x[i] - xMean) / xSd
  ySt[i] = (y[i] - yMean) / ySd

nbText: md"""
We can see that both the $y$ and $x$ values are now centered around zero and have the same scale.
"""

nbCode:
  var standardized = toDf(xSt, ySt)
  ggplot(standardized, aes(x="xSt", y="ySt")) +
    geom_point() +
    ggsave("images/st-simulated-data.png")

nbImage("images/st-simulated-data.png")

nbText: md"""
We can now run the MCMC just as before with some minor changes. The standard deviation
$\tau$ for the standardized data is going to be much smaller. Let's make the prior
more informative. The standard deviation of the proposal distribution used before
would result in most proposals being rejected so we should use a smaller value
this time. Finally we should use some different starting values than before.
"""

nbCode:
  var
    st_priors = Priors(
      b0: Normal(mu:0.0, sigma:1.0),
      b1: Normal(mu:1.0, sigma:1.0),
      sd: Gamma(k:0.0035, sigma:1.0))
    st_mcmc = MCMC(x:xSt, y:ySt, nSamples:100000, priors:st_priors, propSd:0.001)
    st_samples1 = st_mcmc.run(0.0, 1.0, 0.0034)
    st_samples2 = st_mcmc.run(0.1, 1.01, 0.0036)

nbText: md"""
### Convert back to original scale
To interpret these estimates we need to convert back to the original scale.
$$ \beta_{0} = \zeta_{0} SD_{y} + \bar{y} - \zeta_{1} SD_{y} \bar{x} / SD_{x} $$
$$ \beta_{1} = \zeta_{1} SD_{y} / SD_{x} $$
$$ \tau = \zeta_{\tau} * SD_{y} $$
"""

nbCode:
  proc backTransform(samples: Samples, xMean, xSd, yMean, ySd: float): Samples =
    var
      b0 = newSeq[float](samples.n)
      b1 = newSeq[float](samples.n)
      sd = newSeq[float](samples.n)
    for i in 0 ..< samples.n:
      b0[i] = samples.b0[i] * ySd + yMean - samples.b1[i] * ySd * xMean / xSd
      b1[i] = samples.b1[i] * ySd / xSd
      sd[i] = samples.sd[i] * ySd
    result = Samples(n:samples.n, b0:b0, b1:b1, sd:sd)

  var
    st_samples_trans1 = backTransform(st_samples1, xMean, xSd, yMean, ySd)
    st_samples_trans2 = backTransform(st_samples2, xMean, xSd, yMean, ySd)

nbText: md"""
# Traceplots
Let's have a look at the trace plots.
"""

nbCode:
  plotTraces(@[st_samples_trans1, st_samples_trans2], "images/trace-st-")

nbImage("images/trace-st-b0.png")
nbImage("images/trace-st-b1.png")
nbImage("images/trace-st-sd.png")

nbText: md"""
It looks like more of the proposals were accepted and we have a much better
sample from the posterior.
"""

nbText: md"""
# Burnin
Let's get a post burnin sample as we did before.
"""

nbCode:
  var
    st_burnin1 = burn(st_samples_trans1, 0.1)
    st_burnin2 = burn(st_samples_trans2, 0.1)

nbText: md"""
# Histograms
Now we can have a look at the posterior distribution of our estimates.
"""

nbCode:
  plotHistograms(@[st_burnin1, st_burnin2], "images/hist-st-")


nbImage("images/hist-st-b0.png")
nbImage("images/hist-st-b1.png")
nbImage("images/hist-st-sd.png")

nbText: md"""
These look pretty good but it's hard to know how they compare to the previous
MCMC estimates. Let's summarize the posterior to get a better idea.
"""

nbCode:

  var st_burnin = concat(@[st_burnin1, st_burnin2])

  var
    st_meanB0 = mean(st_burnin.b0).round(3)
    st_meanB1 = mean(st_burnin.b1).round(3)
    st_meanSd = mean(st_burnin.sd).round(3)

  echo "Mean b0: ", meanB0
  echo "Mean b1: ", meanB1
  echo "Mean sd: ", meanSd

  var
    (st_b0EtiMin, st_b0EtiMax) = eti(st_burnin.b0, 0.89)
    (st_b1EtiMin, st_b1EtiMax) = eti(st_burnin.b1, 0.89)
    (st_sdEtiMin, st_sdEtiMax) = eti(st_burnin.sd, 0.89)

  echo "Eti b0: ", st_b0EtiMin.round(3), " - ", st_b0EtiMax.round(3)
  echo "Eti b1: ", st_b1EtiMin.round(3), " - ", st_b1EtiMax.round(3)
  echo "Eti sd: ", st_sdEtiMin.round(3), " - ", st_sdEtiMax.round(3)

  var
    (st_b0HdiMin, st_b0HdiMax) = hdi(st_burnin.b0, 0.89)
    (st_b1HdiMin, st_b1HdiMax) = hdi(st_burnin.b1, 0.89)
    (st_sdHdiMin, st_sdHdiMax) = hdi(st_burnin.sd, 0.89)

  echo "Hdi b0: ", st_b0HdiMin.round(3), " - ", st_b0HdiMax.round(3)
  echo "Hdi b1: ", st_b1HdiMin.round(3), " - ", st_b1HdiMax.round(3)
  echo "Hdi sd: ", st_sdHdiMin.round(3), " - ", st_sdHdiMax.round(3)


nbText: md"""
In comparison to the previous mcmc results shown below, it looks like the standardized
analysis was actually pretty similar. But we can have greater confidence after
using the standardized analysis because we got a better posterior sample.
"""

nbCode:
  echo "Mean b0: ", meanB0
  echo "Mean b1: ", meanB1
  echo "Mean sd: ", meanSd

  echo "Eti b0: ", b0EtiMin.round(3), " - ", b0EtiMax.round(3)
  echo "Eti b1: ", b1EtiMin.round(3), " - ", b1EtiMax.round(3)
  echo "Eti sd: ", sdEtiMin.round(3), " - ", sdEtiMax.round(3)

  echo "Hdi b0: ", b0HdiMin.round(3), " - ", b0HdiMax.round(3)
  echo "Hdi b1: ", b1HdiMin.round(3), " - ", b1HdiMax.round(3)
  echo "Hdi sd: ", sdHdiMin.round(3), " - ", sdHdiMax.round(3)

nbText: md"""
# Final Note
We could have simply and more efficiently done all of this using least squares regression.
However the Bayesian approach allows us to very easily and intuitively express
uncertainty about our estimates and can be easily extended to much more complex
models for which there are not such simple solutions. We could also incorporate prior
knowledge or assumptions in a way not possible with frequentist approaches.
The exercise above provides a starting point for this and shows just how easy it is to do with Nim!
"""

nbSave