Model Set Up¶
Link to the dataset
Unfortunately, the database original source
does not report the units on each variable.
Complete example code
Build a linear regression model for predicting Sales using TV as a
predictor.
import pandas as pd
data = pd.read_csv(r'data/data.csv')
import matplotlib.pyplot as plt
fig_1, ax_1 = plt.subplots()
ax_1.plot(data['TV'].values, data['Sales'].values, marker = 'o', linestyle = '', alpha = 0.5)
ax_1.set_xlabel('TV')
ax_1.set_ylabel('Sales')
ax_1.tick_params(bottom = False, top = False, left = False, right = False)
plt.show()
Sales do not follow a perfect linear relationship with respet to
TV, but let’s try to fit a linear model anyway.
Set up a linear regression model, considering TV as regressor and
Sales as the response variable.
Use non-informative priors for regressor and variance:
from baypy.model import LinearModel
import baypy as bp
model_1 = LinearModel()
model_1.data = data
model_1.response_variable = 'Sales'
model_1.priors = {'intercept': {'mean': 0, 'variance': 1e6},
'TV': {'mean': 0, 'variance': 1e6},
'variance': {'shape': 1, 'scale': 1e-6}}
See LinearModel for
more information on this class and its attributes and methods.
Sampling¶
Run the regression sampling on 3 Markov chains, with 500 iterations per each chain and discarding the first 50 burn-in draws:
from baypy.regression import LinearRegression
LinearRegression.sample(model = model_1, n_iterations = 500,
burn_in_iterations = 50, n_chains = 3, seed = 137)
See
LinearRegression
for more information on this class and its attributes and methods.
Convergence Diagnostics¶
Asses the model convergence diagnostics:
bp.diagnostics.effective_sample_size(posteriors = model_1.posteriors)
intercept TV variance
Effective Sample Size 1371.51 1280.94 1428.17
bp.diagnostics.autocorrelaion_summary(posteriors = model_1.posteriors)
intercept TV variance
Lag 0 1.000000 1.000000 1.000000
Lag 1 -0.032025 -0.010411 0.000142
Lag 5 0.011572 0.015856 -0.033278
Lag 10 0.021948 0.010220 -0.003107
Lag 30 -0.028900 -0.023383 0.032368
bp.diagnostics.autocorrelation_plot(posteriors = model_1.posteriors)
See
effective_sample_size,
autocorrelation_summary
and
autocorrelation_plot
for more details on diagnostics functions.
All diagnostics show a low correlation, indicating the chains
converged to the stationary distribution.
Posteriors Analysis¶
Asses posterior analysis:
bp.analysis.trace_plot(posteriors = model_1.posteriors)
bp.analysis.residuals_plot(model = model_1)
See trace_plot and
residuals_plot for
more details on analysis functions.
Residuals form a shape and increase as the predicted variable increses.
This suggests that the model does not fit the data well.
This is consistent with previous exploration of the data: there is no
pure linear relationship between Sales and TV.
Comparing data to fitted model posteriors:
import numpy as np
posteriors_data = pd.DataFrame()
for posterior, posterior_sample in model_1.posteriors.items():
posteriors_data[posterior] = np.asarray(posterior_sample).reshape(-1)
posteriors_data['error 1'] = np.random.normal(loc = 0, scale = np.sqrt(posteriors_data['variance']), size = len(posteriors_data))
tv_1 = np.linspace(data['TV'].min(), data['TV'].max(), 50)
fig_2, ax_2 = plt.subplots()
for row in zip(*posteriors_data.to_dict('list').values()):
sales_1 = row[0] + row[1]*tv_1 + row[3]
ax_2.plot(tv_1, sales_1, color = 'blue', linewidth = 1, alpha = 0.1)
ax_2.plot(data['TV'].values, data['Sales'].values, marker = 'o', linestyle = '',
markerfacecolor = 'none', markeredgecolor = 'red', markeredgewidth = 1.2)
ax_2.set_xlabel('TV')
ax_2.set_ylabel('Sales')
ax_2.tick_params(bottom = False, top = False, left = False, right = False)
plt.tight_layout()
plt.show()
The posterior distribution is rather broad and includes all the data without identifying their true trend. Notice that some posteriors predict negative Sales for low values of TV.
Alternative Model Set Up¶
Try to check a relationship using the log-scale:
fig_3, ax_3 = plt.subplots()
ax_3.loglog(data['TV'].values, data['Sales'].values, marker = 'o', linestyle = '', alpha = 0.5)
ax_3.set_xlabel('TV')
ax_3.set_ylabel('Sales')
plt.show()
The relationship is almost linear in the log-scale. Notice that most of
the data are concentrated toward high values of TV.
Try to fit a linear model in the log-scale; it is required to transform
the data:
data['log TV'] = np.log(data['TV'])
data['log Sales'] = np.log(data['Sales'])
Set up a linear regression model, considering log TV as regressor and
log Sales as the response variable.
Use non-informative priors for regressor and variance:
model_2 = LinearModel()
model_2.data = data
model_2.response_variable = 'log Sales'
model_2.priors = {'intercept': {'mean': 0, 'variance': 1e6},
'log TV': {'mean': 0, 'variance': 1e6},
'variance': {'shape': 1, 'scale': 1e-6}}
Sampling¶
Run the regression sampling on 3 Markov chains, with 500 iteration per each chain and discarding the first 50 burn-in draws:
LinearRegression.sample(model = model_2, n_iterations = 500,
burn_in_iterations = 50, n_chains = 3, seed = 137)
Convergence Diagnostics¶
Asses the model convergence diagnostics:
bp.diagnostics.effective_sample_size(posteriors = model_2.posteriors)
intercept log TV variance
Effective Sample Size 1373.29 1321.11 1428.17
bp.diagnostics.autocorrelation_summary(posteriors = model_2.posteriors)
intercept log TV variance
Lag 0 1.000000 1.000000 1.000000
Lag 1 -0.032124 -0.027163 0.000142
Lag 5 0.011421 0.014532 -0.033278
Lag 10 0.021877 0.021076 -0.003107
Lag 30 -0.028748 -0.030245 0.032368
bp.diagnostics.autocorrelation_plot(posteriors = model_2.posteriors)
All diagnostics show a low correlation, indicating the chains converged to the stationary distribution.
Posteriors Analysis¶
Asses posterior analysis:
bp.analysis.trace_plot(posteriors = model_2.posteriors)
bp.analysis.residuals_plot(model = model_2)
Residuals are generally improved with respect to the original model.
Residuals appear to reflect an increasing dispersion as predicted
variabl increase. However, as already mentioned, notice that most of the
data are concentrated toward high values of TV. The slight growing
pattern is partially caused by this data heterogeneity.
bp.analysis.summary(posteriors = model_2.posteriors)
Number of chains: 3
Sample size per chian: 500
Empirical mean, standard deviation, 95% HPD interval for each variable:
Mean SD HPD min HPD max
intercept 0.905983 0.073676 0.768024 1.057664
log TV 0.354768 0.015388 0.325168 0.384996
variance 0.044958 0.004671 0.035855 0.053536
Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
intercept 0.754455 0.858332 0.905386 0.952843 1.050115
log TV 0.325689 0.344946 0.354906 0.364422 0.386279
variance 0.036865 0.041719 0.044506 0.048087 0.054838
See summary for more
details on this analysis function.
The summary reports a statistical evidence for a positive effect of
log TV: \(10\%\) percent increase in TV would result in
\(1.10^{0.354768} - 1 = 3.44\%\) percent increase in Sales.
Comparing data to fitted model posteriors:
posteriors_data = pd.DataFrame()
for posterior, posterior_sample in model_2.posteriors.items():
posteriors_data[posterior] = np.asarray(posterior_sample).reshape(-1)
posteriors_data['error 2'] = np.random.normal(loc = 0, scale = np.sqrt(posteriors_data['variance']), size = len(posteriors_data))
log_tv_2 = np.linspace(data['log TV'].min(), data['log TV'].max(), 50)
fig_4, ax_4 = plt.subplots()
for row in zip(*posteriors_data.to_dict('list').values()):
log_sales_2 = row[0] + row[1]*log_tv_2 + row[3]
ax_4.plot(log_tv_2, log_sales_2, color = 'blue', linewidth = 1, alpha = 0.1)
ax_4.plot(data['log TV'].values, data['log Sales'].values, marker = 'o', linestyle = '',
markerfacecolor = 'none', markeredgecolor = 'red', markeredgewidth = 1.2)
ax_4.set_xlabel('log TV')
ax_4.set_ylabel('log Sales')
ax_4.tick_params(bottom = False, top = False, left = False, right = False)
plt.tight_layout()
plt.show()
fig_5, ax_5 = plt.subplots()
for row in zip(*posteriors_data.to_dict('list').values()):
log_sales_2 = row[0] + row[1]*log_tv_2 + row[3]
ax_5.plot(np.exp(log_tv_2), np.exp(log_sales_2), color = 'blue', linewidth = 1, alpha = 0.1)
ax_5.plot(data['TV'].values, data['Sales'].values, marker = 'o', linestyle = '',
markerfacecolor = 'none', markeredgecolor = 'red', markeredgewidth = 1.2)
ax_5.set_xlabel('TV')
ax_5.set_ylabel('Sales')
ax_5.tick_params(bottom = False, top = False, left = False, right = False)
plt.tight_layout()
plt.show()
The alternative model’s posteriors catch in a better way the trend of
the data and take into account data dispersion.
For completeness, compare the two models using the DIC:
bp.analysis.compute_DIC(model = model_1)
Deviance at posterior means 1038.13
Posterior mean deviance 1039.18
Effective number of parameters 1.05
Deviance Information Criterion 1040.23
bp.analysis.compute_DIC(model = model_2)
Deviance at posterior means -56.87
Posterior mean deviance -55.82
Effective number of parameters 1.05
Deviance Information Criterion -54.77
See compute_DIC for
more details on this analysis function.
DIC is lower for the alternative model, indicating a preference for
the alternative model in log-scale.