Model Set Up¶
Link to the dataset
Unfortunately, the database original source
does not report the units on each variable.
Complete example code
Determine the effect that the independent variables biking and
smoking have on the dependent variable heart disease using a
multiple linear regression model.
import pandas as pd
data = pd.read_csv(r'data/data.csv')
Set up a multiple linear regression model, considering biking and smoking as regressors and heart disease as the response variable. Use non-informative priors for regressors and variance:
from baypy.model import LinearModel
import baypy as bp
model = LinearModel()
model.data = data
model.response_variable = 'heart disease'
model.priors = {
'intercept': {'mean': 0, 'variance': 1e6},
'biking': {'mean': 0, 'variance': 1e9},
'smoking': {'mean': 0, 'variance': 1e9},
'variance': {'shape': 1, 'scale': 1e-9}
}
See LinearModel for
more information on this class and its attributes and methods.
Sampling¶
Run the regression sampling on 3 Markov chains, with 500 iterations per each chain and discarding the first 50 burn-in draws:
from baypy.regression import LinearRegression
LinearRegression.sample(
model=model,
n_iterations=500,
burn_in_iterations=50,
n_chains=3,
seed=137
)
See
LinearRegression
for more information on this class and its attributes and methods.
Convergence Diagnostics¶
Asses the model convergence diagnostics:
bp.diagnostics.effective_sample_size(posteriors=model.posteriors)
intercept biking smoking variance
Effective Sample Size 1389.56 1449.73 1362.26 1426.75
bp.diagnostics.autocorrelation_summary(posteriors=model.posteriors)
intercept biking smoking variance
Lag 0 1.000000 1.000000 1.000000 1.000000
Lag 1 -0.025015 -0.021166 0.009275 -0.021082
Lag 5 0.027681 -0.007564 0.046201 0.030989
Lag 10 0.015334 0.014290 0.043676 -0.057992
Lag 30 -0.041058 -0.008922 -0.013752 -0.040056
bp.diagnostics.autocorrelation_plot(posteriors=model.posteriors)

See
effective_sample_size,
autocorrelation_summary
and
autocorrelation_plot
for more details on diagnostics functions.
All diagnostics show a low correlation, indicating the chains
converged to the stationary distribution.
Posteriors Analysis¶
Asses posterior analysis:
bp.analysis.trace_plot(posteriors=model.posteriors)

Traces are quite good, indicating draws from the stationary distribution.
bp.analysis.residuals_plot(model=model)

Also, the residuals plot is good: no evidence for patterns, shapes or outliers.
bp.analysis.summary(posteriors=model.posteriors)
Number of chains: 3
Sample size per chian: 500
Empirical mean, standard deviation, 95% HPD interval for each variable:
Mean SD HPD min HPD max
intercept 14.985169 0.079494 14.811145 15.126328
biking -0.200122 0.001387 -0.203015 -0.197531
smoking 0.178261 0.003535 0.171384 0.185280
variance 0.427870 0.027745 0.374502 0.480325
Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
intercept 14.822835 14.933689 14.986087 15.039621 15.141583
biking -0.202909 -0.201028 -0.200086 -0.199219 -0.197334
smoking 0.171140 0.175893 0.178261 0.180621 0.185169
variance 0.380265 0.408345 0.426025 0.446627 0.488800
See trace_plot,
residuals_plot and
summary for more details
on analysis functions.
The summary reports a statistical evidence for:
negative effect of biking: \(1\) point increase in biking would result in \(0.2\) points decrease in heart disease
positive effect of smoking: \(1\) point increase in smoking would result \(0.18\) points increase in heart disease