Auto MPG¶
Model Set Up¶
Link to the dataset
Dataset original source
Complete example code
Determine the effect of car attributes on fuel consumption.
The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.
Quinlan, 1993
import pandas as pd
data = pd.read_csv(r'data/data.csv')
data.dropna(inplace=True)
Data exploration:
import matplotlib.pyplot as plt
pd.plotting.scatter_matrix(frame=data, figsize=(10, 10))
plt.tight_layout()
plt.show()
The scatterplot shows that horsepower, displacement and weight
are strongly correlated among each other, meaning that they cannot be
used as independent regressors. For this reason, horsepower and
displacement are discarded, keeping weight as regressor.
Moreover, mpg and weight are not normally distributed: data are
skewed toward high values. For this reason, these columns are
transformed to log-scale:
import numpy as np
data['log mpg'] = np.log(data['mpg'])
data['log weight'] = np.log(data['weight'])
Set up a linear regression model, considering cylinders, log weight,
acceleration and model year as the regressors and log mpg as the
response variable.
Using non-informative priors for regressors and variance:
from baypy.model import LinearModel
import baypy as bp
model = LinearModel()
model.data = data
model.response_variable = 'log mpg'
model.priors = {
'intercept': {'mean': 0, 'variance': 1e6},
'cylinders': {'mean': 0, 'variance': 1e6},
'log weight': {'mean': 0, 'variance': 1e6},
'acceleration': {'mean': 0, 'variance': 1e6},
'model year': {'mean': 0, 'variance': 1e6},
'variance': {'shape': 1, 'scale': 1e-6}
}
See LinearModel for
more information on this class and its attributes and methods.
Sampling¶
Run the regression sampling on 3 Markov chains, with 1000 iterations per each chain and discarding the first 50 burn-in draws:
from baypy.regression import LinearRegression
LinearRegression.sample(
model=model,
n_iterations=1000,
burn_in_iterations=50,
n_chains=3,
seed=137
)
See
LinearRegression
for more information on this class and its attributes and methods.
Convergence Diagnostics¶
Asses the model convergence diagnostics:
bp.diagnostics.effective_sample_size(posteriors=model.posteriors)
intercept cylinders log weight acceleration model year variance
Effective Sample Size 2873.12 2754.12 2685.45 2510.45 2338.35 2818.65
bp.diagnostics.autocorrelation_summary(posteriors=model.posteriors)
intercept cylinders log weight acceleration model year variance
Lag 0 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
Lag 1 -0.037663 -0.017715 -0.034716 0.032738 -0.012336 0.002950
Lag 5 0.020242 0.002023 0.009178 0.015885 0.037428 -0.035680
Lag 10 -0.001631 -0.007542 -0.017864 -0.009563 0.019790 0.031953
Lag 30 -0.023641 -0.010533 -0.026705 0.005014 0.002749 0.021754
bp.diagnostics.autocorrelation_plot(posteriors=model.posteriors)
See
effective_sample_size,
autocorrelation_summary
and
autocorrelation_plot
for more details on diagnostics functions.
All diagnostics show a low correlation, indicating the chains
converged to the stationary distribution.
Posteriors Analysis¶
Asses posterior analysis:
bp.analysis.trace_plot(posteriors=model.posteriors)
Traces are good, indicating draws from the stationary distribution.
bp.analysis.residuals_plot(model=model)
Also, the residuals plot is good: no evidence for patterns, shapes or outliers.
bp.analysis.summary(posteriors=model.posteriors)
Number of chains: 3
Sample size per chian: 1000
Empirical mean, standard deviation, 95% HPD interval for each variable:
Mean SD HPD min HPD max
intercept 7.396536 0.355956 6.645237 8.071871
cylinders -0.016291 0.008272 -0.033670 -0.001302
log weight -0.837066 0.046627 -0.928362 -0.746617
acceleration 0.003610 0.002505 -0.001175 0.008532
model year 0.031550 0.001735 0.028306 0.035048
variance 0.013651 0.000981 0.011704 0.015532
Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
intercept 6.673473 7.168108 7.401501 7.626908 8.105342
cylinders -0.032550 -0.021880 -0.016244 -0.010672 0.000010
log weight -0.927876 -0.868419 -0.837485 -0.805797 -0.746021
acceleration -0.001315 0.001901 0.003640 0.005298 0.008422
model year 0.028102 0.030384 0.031521 0.032740 0.034908
variance 0.011840 0.012967 0.013594 0.014286 0.015681
See trace_plot,
residuals_plot and
summary for more details
on analysis functions.
The summary reports a statistical evidence for:
negative effect of cylinders: \(1\) cylinder increase would result in \(e^{-0.016291} - 1 = -1.62\%\) percent decrease in mpg
negative effect of log weight: \(10\%\) percent increase in weight would result in \(1.10^{-0.837066} - 1 = -7.67\%\) percent decrease in mpg
positive effect of model year: \(1\) year increase would result in \(0.03\) points increase in mpg. This effect may represent the efficiency enhancements made along the years to reduce fuel consumption