Salary¶

Model Set Up¶

Link to the dataset
Dataset original source
Complete example code
Determine the effect of the years of experience on salary of jobholders using a simple linear regression model.

import pandas as pd

data = pd.read_csv(r'data/data.csv')

Set up a linear regression model, considering YearsExperience as the regressor and Salary as the response variable.
Using non-informative priors for regressors and variance:

from baypy.model import LinearModel
import baypy as bp

model = LinearModel()
model.data = data
model.response_variable = 'Salary'
model.priors = {
    'intercept': {'mean': 0, 'variance': 1e12},
    'YearsExperience': {'mean': 0, 'variance': 1e12},
    'variance': {'shape': 1, 'scale': 1e-12}
}

See LinearModel for more information on this class and its attributes and methods.

Sampling¶

Run the regression sampling on 3 Markov chains, with 5000 iterations per each chain and discarding the first 50 burn-in draws:

from baypy.regression import LinearRegression

LinearRegression.sample(
    model=model,
    n_iterations=5000,
    burn_in_iterations=50,
    n_chains=3,
    seed=137
)

See LinearRegression for more information on this class and its attributes and methods.

Convergence Diagnostics¶

Asses the model convergence diagnostics:

bp.diagnostics.effective_sample_size(posteriors=model.posteriors)

                       intercept  YearsExperience  variance
Effective Sample Size   14757.25         14718.82  12692.31

bp.diagnostics.autocorrelation_summary(posteriors=model.posteriors)

        intercept  YearsExperience  variance
Lag 0    1.000000         1.000000  1.000000
Lag 1   -0.003427         0.003354  0.062828
Lag 5    0.014196         0.014811 -0.019225
Lag 10   0.002163         0.006671  0.008239
Lag 30  -0.000465        -0.002819  0.000682

bp.diagnostics.autocorrelation_plot(posteriors=model.posteriors)

See effective_sample_size, autocorrelation_summary and autocorrelation_plot for more details on diagnostics functions.
All diagnostics show a low correlation, indicating the chains converged to the stationary distribution.

Posteriors Analysis¶

Asses posterior analysis:

bp.analysis.trace_plot(posteriors=model.posteriors)

Traces are good, indicating draws from the stationary distribution.

bp.analysis.residuals_plot(model=model)

Also, the residuals plot is good: no evidence for patterns, shapes or outliers.

bp.analysis.summary(posteriors=model.posteriors)

Number of chains:           3
Sample size per chian:   5000

Empirical mean, standard deviation, 95% HPD interval for each variable:

                         Mean            SD       HPD min       HPD max
intercept        2.483031e+04  2.365229e+03  2.015569e+04  2.947465e+04
YearsExperience  9.453053e+03  3.863412e+02  8.702788e+03  1.022794e+04
variance         3.477599e+07  9.970182e+06  1.838234e+07  5.414517e+07

Quantiles for each variable:

                         2.5%           25%           50%           75%         97.5%
intercept        2.013048e+04  2.329292e+04  2.484871e+04  2.639967e+04  2.946356e+04
YearsExperience  8.696984e+03  9.198969e+03  9.447750e+03  9.703882e+03  1.022423e+04
variance         2.049029e+07  2.779544e+07  3.309215e+07  3.986484e+07  5.835968e+07

See trace_plot, residuals_plot and summary for more details on analysis functions.
The summary reports a statistical evidence for a positive effect of years of experience: \(1\) year increase in experience would result in \(\sim 9.500\) increase in salary.
Predict the salary distribution for a jobholder with 5 year of experience, so the predictor is YearsExperience = 5:

import matplotlib.pyplot as plt
import numpy as np

distribution = model.predict_distribution(predictors={'YearsExperience': 5})

fig_2, ax_2 = plt.subplots()

ax_2.hist(
    distribution,
    bins=int(np.sqrt(len(distribution))),
    color='blue',
    alpha=0.5,
    density=True
)

ax_2.set_xlabel('Salary')
ax_2.set_ylabel('Probability Density')
ax_2.tick_params(bottom=False, top=False, left=False, right=False)

plt.tight_layout()

plt.show()

See LinearModel.predict_distribution for more information on this method.
Comparing data to fitted model posteriors:

posteriors_data = pd.DataFrame()
for posterior, posterior_sample in model.posteriors.items():
    posteriors_data[posterior] = np.asarray(posterior_sample).reshape(-1)
posteriors_data['error'] = np.random.normal(
    loc=0,
    scale=np.sqrt(posteriors_data['variance']),
    size=len(posteriors_data)
)

years_experience = np.linspace(
    data['YearsExperience'].min(),
    data['YearsExperience'].max(),
    50
)


fig_1, ax_1 = plt.subplots()

for row in zip(*posteriors_data.to_dict('list').values()):
    salary = row[0] + row[1]*years_experience + row[3]
    ax_1.plot(
        years_experience,
        salary,
        color='blue',
        linewidth=1,
        alpha=0.1
    )
ax_1.plot(
    data['YearsExperience'].values,
    data['Salary'].values,
    marker='o',
    linestyle='',
    markerfacecolor='none',
    markeredgecolor='red',
    markeredgewidth=1.2
)

ax_1.set_xlabel('YearsExperience')
ax_1.set_ylabel('Salary')
ax_1.tick_params(bottom=False, top=False, left=False, right=False)

plt.tight_layout()

plt.show()