Salary¶
Model Set Up¶
Link to the dataset
Dataset original source
Complete example code
Determine the effect of the years of experience on salary of jobholders
using a simple linear regression model.
import pandas as pd
data = pd.read_csv(r'data/data.csv')
Set up a linear regression model, considering YearsExperience as the
regressor and Salary as the response variable.
Using non-informative priors for regressors and variance:
from baypy.model import LinearModel
import baypy as bp
model = LinearModel()
model.data = data
model.response_variable = 'Salary'
model.priors = {'intercept': {'mean': 0, 'variance': 1e12},
'YearsExperience': {'mean': 0, 'variance': 1e12},
'variance': {'shape': 1, 'scale': 1e-12}}
See LinearModel for
more information on this class and its attributes and methods.
Sampling¶
Run the regression sampling on 3 Markov chains, with 5000 iterations per each chain and discarding the first 50 burn-in draws:
from baypy.regression import LinearRegression
regression = bp.regression.LinearRegression(model = model)
LinearRegression.sample(model = model, n_iterations = 5000,
burn_in_iterations = 50, n_chains = 3, seed = 137)
See
LinearRegression
for more information on this class and its attributes and methods.
Convergence Diagnostics¶
Asses the model convergence diagnostics:
bp.diagnostics.effective_sample_size(posteriors = model.posteriors)
intercept YearsExperience variance
Effective Sample Size 14757.25 14718.82 12692.31
bp.diagnostics.autocorrelation_summary(posteriors = model.posteriors)
intercept YearsExperience variance
Lag 0 1.000000 1.000000 1.000000
Lag 1 -0.003427 0.003354 0.062828
Lag 5 0.014196 0.014811 -0.019225
Lag 10 0.002163 0.006671 0.008239
Lag 30 -0.000465 -0.002819 0.000682
bp.diagnostics.autocorrelation_plot(posteriors = model.posteriors)
See
effective_sample_size,
autocorrelation_summary
and
autocorrelation_plot
for more details on diagnostics functions.
All diagnostics show a low correlation, indicating the chains
converged to the stationary distribution.
Posteriors Analysis¶
Asses posterior analysis:
bp.analysis.trace_plot(posteriors = model.posteriors)
Traces are good, incidating draws from the stationary distribution.
bp.analysis.residuals_plot(model = model)
Also the residuals plot is good: no evidence for patterns, shapes or outliers.
bp.analysis.summary(posteriors = model.posteriors)
Number of chains: 3
Sample size per chian: 5000
Empirical mean, standard deviation, 95% HPD interval for each variable:
Mean SD HPD min HPD max
intercept 2.483031e+04 2.365229e+03 2.015569e+04 2.947465e+04
YearsExperience 9.453053e+03 3.863412e+02 8.702788e+03 1.022794e+04
variance 3.477599e+07 9.970182e+06 1.838234e+07 5.414517e+07
Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
intercept 2.013048e+04 2.329292e+04 2.484871e+04 2.639967e+04 2.946356e+04
YearsExperience 8.696984e+03 9.198969e+03 9.447750e+03 9.703882e+03 1.022423e+04
variance 2.049029e+07 2.779544e+07 3.309215e+07 3.986484e+07 5.835968e+07
See trace_plot,
residuals_plot and
summary for more details
on analysis functions.
The summary reports a statistical evidence for a positive effect of
years of experience: \(1\) year increase in experience would result in
\(\sim 9.500\) increase in salary.
Predict the salary distribution for a jobholder with 5 year of
experience, so the predictor is YearsExperience = 5:
import matplotlib.pyplot as plt
import numpy as np
distribution = model.predict_distribution(predictors = {'YearsExperience': 5})
fig_2, ax_2 = plt.subplots()
ax_2.hist(distribution, bins = int(np.sqrt(len(distribution))), color = 'blue', alpha = 0.5, density = True)
ax_2.set_xlabel('Salary')
ax_2.set_ylabel('Probability Density')
ax_2.tick_params(bottom = False, top = False, left = False, right = False)
plt.tight_layout()
plt.show()
See
LinearModel.predict_distribution
for more information on this method.
Comparing data to fitted model posteriors:
posteriors_data = pd.DataFrame()
for posterior, posterior_sample in model.posteriors.items():
posteriors_data[posterior] = np.asarray(posterior_sample).reshape(-1)
posteriors_data['error'] = np.random.normal(loc = 0, scale = np.sqrt(posteriors_data['variance']), size = len(posteriors_data))
years_experience = np.linspace(data['YearsExperience'].min(), data['YearsExperience'].max(), 50)
fig_1, ax_1 = plt.subplots()
for row in zip(*posteriors_data.to_dict('list').values()):
salary = row[0] + row[1]*years_experience + row[3]
ax_1.plot(years_experience, salary, color = 'blue', linewidth = 1, alpha = 0.1)
ax_1.plot(data['YearsExperience'].values, data['Salary'].values, marker = 'o', linestyle = '',
markerfacecolor = 'none', markeredgecolor = 'red', markeredgewidth = 1.2)
ax_1.set_xlabel('YearsExperience')
ax_1.set_ylabel('Salary')
ax_1.tick_params(bottom = False, top = False, left = False, right = False)
plt.tight_layout()
plt.show()