.

.

.

.

.

HealthyNumerics

HealthPoliticsEconomics | Quant Analytics | Numerics

Basic Stats 02: A mechanical approach to the Bayesian rule

Intro

The Bayesian rule

$$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$

seems to be a rather abstract formula. But this impression can be corrected easely when considering a simple practical application. We call this approach mechanical because for this type of application there is no philosphical dispute between frequentist's and bayesian's mind set. We will just use in a mechanical/algorithmical manner the cells of a matrix. In this section we

• formulate a prototype of a probability problem (with red and green balls that have letters A and B printed on)
• summarize the problem in a basically 2x2 matrix (called contingency table)
• use the frequencies first
• replace them by probalities afterwards and recognize what conditional probablities are
• recognize that applying the Bayesian formula is nothing else than walking from one side of the matrix to the other side


A prototype of a probability problem

Given:

• 19 balls
• 14 balls are red, 5 balls are green
• among the 14 red balls, 4 have a A printed on, 10 have a B printed on
• among the 5 green balls, 1 has a A printed on, 4 have a B printed on

Typical questions: - we take 1 ball. - What is the probabilitiy that it is green ? - What is the probabilitiy that it is B under the precondition it's green ?

We will use Pandas to represent the problem and the solutions

import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
import numpy as np
import pandas as pd
pd.set_option('precision', 3)



Contingency table with the frequencies

The core of the representation is a 2x2 matrix that summarizes the situation of the balls with the colors and the letters on. This matrix is expanded with the margins that contain the sums. - sum L stands for the sum of the letters - sum C stands for the sum of the colors

columns = np.array(['A', 'B'])
index   = np.array(['red', 'green'])
data    = np.array([[4,10],[1,4]])

df = pd.DataFrame(data=data, index=index, columns=columns)
df['sumC'] = df.sum(axis=1)  # append the sums of the rows
df.loc['sumL']= df.sum()     # append the sums of the columns
df

A B sumC
red 4 10 14
green 1 4 5
sumL 5 14 19

Append the relative contributions of B and of green

We expand the matrix by a further column and a further row and use them to compute relative frequencies (see below).

def highlight_cells(x):
df = x.copy()
df.loc[:,:] = ''
df.iloc[1,1] = 'background-color: #53ff1a'
df.iloc[1,3] = 'background-color: lightgreen'
df.iloc[3,1] = 'background-color: lightblue'
return df

df[r'B/sumC']         = df.values[:,1]/df.values[:,2]
df.loc[r'green/sumL'] = df.values[1,:]/df.values[2,:]

t = df.style.apply(highlight_cells, axis=None)
t


A B sumC B/sumC
red 4 10 14 0.714
green 1 4 5 0.8
sumL 5 14 19 0.737
green/sumL 0.2 0.286 0.263 1.09

Let's focus on the row with the green balls

From all green balls (= 5) is the portion of those with a letter B (=4) 0.8

$$\frac{green\: balls\: with \:B}{all\: green\: balls} = \frac{4}{5} = 0.8$$

Note that this value already corresponds to the conditional probality $$P(B \mid green)$$



Let's focus on the column with the balls with letter B

From all balls with letter B (= 14) is the portion of those that are green (=4) 0.286

$$\frac{green\; balls\; with\; B}{all\; balls\; with\; letter\; B} = \frac{4}{14} = 0.286$$

Note that also this value already corresponds to the conditional probality $$P(green \mid B)$$




Contingency table with the probabilities

We find the probabilities by dividing the frequencies by the sum of balls.

columns = np.array(['A', 'B'])
index   = np.array(['red', 'green'])
data    = np.array([[4,10],[1,4]])
data    = data/np.sum(data)

df = pd.DataFrame(data=data, index=index, columns=columns)
df['sumC'] = df.sum(axis=1)  # append the sums of the rows
df.loc['sumL']= df.sum()     # append the sums of the columns
df

A B sumC
red 0.211 0.526 0.737
green 0.053 0.211 0.263
sumL 0.263 0.737 1.000


Append the relative contributions of B and of green

df[r'B/sumC']         = df.values[:,1]/df.values[:,2]
df.loc[r'green/sumL'] = df.values[1,:]/df.values[2,:]
t = df.style.apply(highlight_cells, axis=None)
t


A B sumC B/sumC
red 0.211 0.526 0.737 0.714
green 0.0526 0.211 0.263 0.8
sumL 0.263 0.737 1 0.737
green/sumL 0.2 0.286 0.263 1.09



Note the formula in the cells

columns = np.array(['-----------A----------', '---------B----------', '----------sumC--------',  '--------B/sumC--------'])
index   = np.array(['red', 'green', 'sumL',  'green/sumL'])
data    = np.array([['...','...','...','...'],
['...', '$P(B \cap green)$', '$P(green)$', '$P(B \mid green)$'],
['...','$P(B)$','...','...'],
['...','$P(green \mid B)$','...','...'] ])
df = pd.DataFrame(data=data, index=index, columns=columns)
t = df.style.apply(highlight_cells, axis=None)
t


-----------A---------- ---------B---------- ----------sumC-------- --------B/sumC--------
red ... ... ... ...
green ... $$P(B \cap green)$$ $$P(green)$$ $$P(B \mid green)$$
sumL ... $$P(B)$$ ... ...
green/sumL ... $$P(green \mid B)$$ ... ...



Conditional probability: Let's focus on the row with the green balls

The probability to get a ball with B out of all green balls is 0.8

$$P(B \mid green) = \frac{P(green\: balls\: with \:B)}{P(all\: green\: balls)} = \frac{P(green \cap B)}{P(green)} = \frac{0.211}{0.263} = 0.8$$


Conditional probability: Let's focus on the column with the balls with letter B

The probability to get a green ball out of all balls with a B is 0.286

$$P(green \mid B) = \frac{P(green\; balls\; with\; B)}{P(all\; balls\; with\; letter\; B)} = \frac{P(green \cap B)}{P(B)} = \frac{0.211}{0.737} = 0.286$$


Bayes rule

Given $$P(green \mid B$$) find $$P(B \mid green)$$ :

$$P(B \mid green) = \frac{P(green \mid B) \cdot P(B)}{P(green)} = \frac{0.286 \cdot 0.737}{0.263} = \frac{0.211}{0.263}= 0.8$$


and given $$P(B \mid green)$$ find $$P(green \mid B$$):

$$P(green \mid B) = \frac{P(B \mid green) \cdot P(green)}{P(B)} = \frac{0.8 \cdot 0.263}{0.737} = \frac{0.211}{0.737} = 0.286$$

Applying the Bayes rule means that we walk from the element most right in the matrix to the element at the bottom of the matrix and vice versa:

show_frequencies()


def show_frequencies():
px = 4; py = 4

figsize(10, 10)
fontSize1 = 15
fontSize2 = 12
fontSize3 = 20

A1 = 4;     A2 = 10;       A3 = A1+A2
B1 = 1;     B2 =  4;       B3 = B1+B2
C1 = A1+B1; C2 = A2+B2;   C3 = A3+B3

data = np.array([[A1, A2, A3],
[B1, B2, B3],
[C1, C2, C3] ])

clr     = np.array([ ['#ffc2b3', '#ff704d', '#ff0000'],
['#b3ff99', '#53ff1a', '#208000'],
['#00bfff', '#0000ff', '#bf00ff']   ])

title  = np.array([['A and red', 'B and red', 'sum of reds',
'$P(B|red) = \\frac {P(A\\cap red)}{P(red)}$'           ],
['A and green', 'B and green', 'sum greens',   '% (B of greens)'     ],
['sum of A', 'sum of B', 'sum of all balls',' '        ],
[' ', '% (greens of B)', '',  ' '     ],
]  )

xlabel = np.array([ ['A', 'B', 'A+B', 'B/(A+B)'],
['A', 'B', 'A+B', 'B/(A+B)'],
['A red + A green', 'B red + B green', 'A+B', 'B/(A+B)'],
['A', 'B green/(B red + B green)', 'A+B', 'B/(A+B)']       ]  )

ylabel = np.array([ ['red',   'red',   'red',  'B of reds'],
['green', 'green', 'green','B of greens'],
['sum A', 'sum B', 'Total', '' ],
['green', 'greens of B', 'green', ' '],    ])

f, ax = plt.subplots(px, py, sharex=True, sharey=True, edgecolor='none')  #, facecolor='lightgrey'

for i in range(px-1):
for j in range(py-1):

patches, texts =ax[i,j].pie([data[i,j]], labels=[str(data[i,j])], #autopct='%1.1f%%',
colors = [clr[i,j]])
texts[0].set_fontsize(fontSize3)
ax[i,j].set_title(title[i,j], position=(0.5,1.2), bbox=dict(facecolor='#f2f2f2', edgecolor='none'), fontsize= fontSize1)
if i*j==1 :
ax[i,j].set_title(title[i,j], position=(0.5,1.2), bbox=dict(facecolor='#bfbfbf', edgecolor='none'), fontsize=fontSize1)

ax[i,j].set_xlabel(xlabel[i,j], fontsize=fontSize2, color='c')
ax[i,j].xaxis.set_label_position('top')

ax[i,j].set_ylabel(ylabel[i,j], fontsize=fontSize2, color='c')

ax[i,j].axis('equal')

j += 1
if (i == 1):
p = data[i,-2]/data[i,-1]; o = p/(1-p)
patches, texts =ax[i,j].pie([o,1], #autopct='%1.1f%%',
colors = [clr[i,1], clr[i,2] ] )
texts[0].set_fontsize(fontSize3)
ax[i,j].set_title(title[i,j], position=(0.5,1.2), bbox=dict(facecolor='#bfbfbf', edgecolor='none'), fontsize=fontSize1)
ax[i,j].set_xlabel(xlabel[i,j], color='c', fontsize=fontSize2)
ax[i,j].xaxis.set_label_position('top')
ax[i,j].set_ylabel(ylabel[i,j], fontsize=fontSize2, color='c')
ax[i,j].axis('equal')
else:
ax[i,j].plot(0,0)
ax[i,j].set_frame_on(False)

i = px-1;
for j in range(py):
if j==1:
p = data[1,1]/data[2,1]; o = p/(1-p)
patches, texts =ax[i,j].pie([o,1], #autopct='%1.1f%%',
colors = [clr[1,1], clr[2,1] ])
texts[0].set_fontsize(fontSize3)
ax[i,j].set_title(title[i,j], position=(0.5,1.2), bbox=dict(facecolor='#bfbfbf', edgecolor='none'), fontsize=fontSize1)

ax[i,j].set_xlabel(xlabel[i,j], color='c', fontsize=fontSize2)
ax[i,j].xaxis.set_label_position('top')
ax[i,j].set_ylabel(ylabel[i,j], fontsize=fontSize2, color='c')
ax[i,j].axis('equal')
ax[i,j].set_facecolor('y')
else:
ax[i,j].plot(0,0)
ax[i,j].set_frame_on(False)

plt.tight_layout()
plt.show()

from IPython.display import HTML

# HTML('''<script> $('div .input').hide()''') #lässt die input-zellen verschwinden HTML('''<script> code_show=true; function code_toggle() { if (code_show){$('div.input').hide();
} else {
$('div.input').show(); } code_show = !code_show }$( document ).ready(code_toggle);
</script>