I assume that the readers know the Bayes’ rule already. If you are not familiar to it, read any kind of textbook about probability, data science, and machine learning. I recommend the book, which I learned Bayes’ rule. Bayesians say that you cannot do inference without making assumptions. Thus, Bayesians also use probabilities to describe inferences. The author in the chapter 2 introduces some rules of probability theory and introduces more about assumptions in inference in the chapter 3. I recommend you to read the parts even if you know Bayesian theory and how to use it.


I find the following two exercises are quite simple and very good to understand how the Bayesian theorem works in more realistic problems. These are the exercise 22.15 and the exercise 24.3 of the book Information Theory, Inference, and Learning Algorithms of David Mackay.


Exercise 22.15


Question


N data points ${x_n}$ are drawn from N distributions, all of which are Gaussian with a common mean $\mu$ but with different unknown standard deviations $\sigma_n$. What are the maximum likelihood parameters $\mu$, ${\sigma_n}$ given the data? For example, seven scientists (A, B, C, D, E, F, G) with wildly-differing experimental skills measure $\mu$.


$$ \begin{eqnarray} x_n(A) &=& -27.020 \\
x_n(B) &=& 3.570 \\
x_n({C}) &=& 8.191 \\
x_n(D) &=& 9.898 \\
x_n(E) &=& 9.603 \\
x_n(F) &=& 9.945 \\
x_n(G) &=& 10.056 \end{eqnarray} $$


You expect some of them to do accurate work (i.e., to have small $\sigma_n$), and some of them to turn in wildly inaccurate answers (i.e., to have enormous $\sigma_n$). What is $\mu$, and how reliable is each scientist?


Solution


$$ N=7 $$

$$ x_n=(-27.02, 3.57, 8.191, 9.898, 9.603, 9.945, 10.056) $$

$$ {\sum_n x_n\over N}=3.46329 $$

It must not be the correct mean estimation.


The likelihood is as followings by the description of the problem.

$$ P({x_n}|\sigma_n,\mu) = ({1 \over 2\pi })^{N/2} \Pi_n{1\over \sigma_n } \exp(-\sum_n {(x_n-\mu)^2 \over 2\sigma^2_n}) $$

To find the maximum likelihood,


$$ L \equiv \log P({x_n}|\sigma_n,\mu) = -\sum_n \log~\sigma_n-\sum_n {(x_n -\mu)^2 \over 2\sigma_n^2} $$

$$ {\partial L \over \partial x_n} = -\sum_n \left({x_n-\mu \over \sigma^2_n}\right)=-\sum_n{x_n \over \sigma_n^2}+\mu\sum_n {1 \over \sigma_n^2} = 0 $$


Therefore,

$$ \mu= {\sum_n{x_n\over \sigma_n^2}\over{1\over \sigma_n^2}} $$


When $x_n=(-27.02, 3.57)$, the correspondent $\sigma_n$ are so huge, then it contributes very tiny in $\mu$. Then, the properly inferred estimated mean should be close to mean of $(8.191, 9.898,$ $9.603, 9.945, 10.056)$.


Mathematica


I have introduced this Wolfram Mathematica when I used it to debug my Python code. Today, for simple data and plot, this Mathematica will show its convenience. I will also show the brief syntax while finding the solution of the problem. Readers should not need to understand the syntax but want to show how this algebraic formulation works well.


Exercise 24.3


Question


Give a Bayesian solution to exercise 22.15, where seven scientists of varying capabilities have measured $\mu$ with personal noise levels $\sigma_n$, and we are interested in inferring $\mu$. Let the prior on each $\sigma_n$ be a broad prior, for example a gamma distribution with parameters $(s,c) = (10, 0.1)$. Find the posterior distribution of $\mu$. Plot it, and explore its properties for a variety of data sets such as the one given.


Solution


Bayesian for posterior probability of $\sigma_n$ is

$$ P(\sigma_n|{x_n},\mu)= {P({x_n}|\sigma_n,\mu) P(\sigma_n)\over P({x_n}|\mu)} $$

Given

$$ P({x_n}|\sigma_n,\mu) = ({1 \over 2\pi })^{N/2} \left( \Pi_n{1\over \sigma_n }\right) \exp\left(-\sum_n {(x_n-\mu)^2 \over 2\sigma^2_n}\right) $$

and

$$ P(\sigma_n)=(\frac{1}{{\Gamma}({c}) ~ s^c})^N\Pi_n~(\sigma_n)^{c-1}~\exp({\sum_n \sigma_n \over s}) $$

where $(s,~c)=(10,0.1)$.

and the normalizing constant is

$$ P({x_n}|\mu)=\int^{\infty}_0~\Pi_n~d\sigma_n ~P({x_n}|\sigma_n,\mu) P(\sigma_n) $$

The posterior probability of $\mu$ is

$$ P(\mu|x_n)={P({x_n}|\mu)P(\mu)\over P({x_n})} $$

and the prior is determined by some assumption.

$P(\mu) = {1 \over \sigma_{\mu}}=const.$, then the normalizing constant is


$$ P({x_n})=\int^{\infty}_{-\infty} ~d\mu ~P({x_n}|\mu)\frac{1}{\sigma_{\mu}} $$



Mathematica analysis


Data


Exercise24.3_1.png


Exercise24.3_2.png


Sample mean


Exercise24.3_3.png


Exercise24.3_4.png


Normalizing constant of Gaussian distribution


We can obtain analytic result of the integration!


Exercise24.3_5.png


Exercise24.3_6.png



Posterior distribution function without normalizing constant



Exercise24.3_7.png


Exercise24.3_8.png


Exercise24.3_9.png


Exercise24.3_10.png


Define function for posterior distribution of one data


We use the analytic result of the integration directly to plot.

Exercise24.3_11.png


Exercise24.3_12.png


Exercise24.3_13.png


Exercise24.3_14.png


Exercise24.3_15.png


Exercise24.3_16.png


Exercise24.3_17.png


Exercise24.3_18.png


Plot


Parameters


Exercise24.3_19.png


Exercise24.3_20.png


Exercise24.3_21.png


Posterior distributions


Posterior $P(\mu|x_n)$ for one datum from $x_n = -10$.

Exercise24.3_22.png


Exercise24.3_23.gif


Posterior $P(\mu|x_n)$ for one datum from $x_n = -27$.

Exercise24.3_24.png


Exercise24.3_25.gif


Posterior $P(\mu|\boldsymbol{x})$ is obtained by products of every $P(\mu|x_i)$

Exercise24.3_26.png


Exercise24.3_27.gif


Exercise24.3_28.png


Graphics:None


See a small bump around $\mu=-27$, contributed by posterior probability, $P(\mu|x_n = -27)$.

Exercise24.3_30.png


Exercise24.3_31.gif


Exercise24.3_32.png


Graphics:None